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EDITOR’S COMMENTS 


Presenting the 2018 Winter issue of the Journal of the Washington Academy 
of Sciences. 


For this issue we have our first (since I can remember) letter to the 
editor. I encourage people to write letters to the editor. Please send email 
(wasjournal@washacadsci.org) comments on papers, suggestions for 
articles, and ideas for what you would like to see in the Journal. 


We start with a column by one of our Board members: the 
Administrative Vice President. This is a good addition to the Journal and 
informative as well. Perhaps such columns can become a regular part of the 
Journal. 


To follow is a book review of Accessory to War: The Unspoken 
Alliance between Astrophysics and the Military authors Neil de Grasse 
Tyson and Avis Lang. 


Next up is a student paper by Lydia Chance from Frederick 
Community College. We encourage student papers and help the student to 
learn about writing a scientific paper. 


Then a two multi-author papers: one on computer search engines for 
natural language documents, the other on reusable models of manufacturing 
processing. 


Every winter we print a list of members and addresses. Please check 
to see that you are listed correctly. The Academy covers the greater 
Washington DC area including parts of Virginia and Maryland. Most of our 
members live in Maryland. 


The Journal is the official organ of the Academy. Please consider 
sending in technical papers, review studies, announcements, and book 
reviews. 


We are a peer reviewed journal and need volunteer reviewers. If you 
would like to be on our reviewer list please send email to the above address 
and include your specialty. 


Sethanne Howard 
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government agencies such as the National Institute of Standards and 
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Letters to the Editor 
From: Jeff Bullard, Fellow, WAS 


| was disappointed in reading article contributed by C. Sluzki entitled “The 
Impact of Authoritarian Regimes”, in Volume 104, Issue 3, pp. 11-18. That 
article contains, among other disturbing passages, the following paragraph 
at the top of p. 14: 


These are worrisome times. Far right, ethnic-nationalist, populist, racist, 
sexist, anti-immigrants (sic), anti-abortion rights, anti-ecological, anti-free 
speech, post-facts (post-truth!), authoritarian candidates and governments 
are gaining strength world-wide. We are facing a world being 
progressively seized by charismatic leaders who may not yet be tyrants with 
a simplified polarizing discourse capable of perpetrating enormous 
evil. And, even while many of these ideologies didn't triumph electorally-- 
-as happened in some European countries---the effect of their rise has been 
that majority of the center parties have moved several inches toward social 
intolerance, as a way of capturing a portion of the electorate attracted by 
those polarizing discourses. 


| am troubled that the author introduced this kind of explicit, inflammatory, 
and highly subjective political bias which compromises the veracity of the 
rest of the article. Even a cursory survey of modern world history 
demonstrates that authoritarian regimes, the ostensible subject of the paper, 
do not arise from one particular political ideology as the author asserts. Are 
there no far-left regimes that trouble the author? Are far-right governments 
the only ones that are anti-free speech or anti-ecological? Are there no far- 
left regimes to be found with a tinge of authoritarianism, or is the author 
simply untroubled by far-left tactics? Both here and in his earlier “even 
more personal vignette” on p. 13, the author reveals a significant bias that 
would seriously undermine any attempt at analysis (if there had been any 
actual scientific analysis) in the article. | hope I am not the only reader who 
thinks that Dr. Sluzki’s article is unfitting content for a journal committed 
to scientific discourse instead of sensationalistic political opinions. 


Washington Academy of Sciences 


Response: Carlos E. Sluzki, MD, Fellow, WAS 


I truly appreciate Dr. Bullard’s comments: Criticism is more generous and 
constructive than silence! 


Dr. Bullard is right in his first point: I could, and perhaps should, have 
omitted the words ‘right wing” from my article (or at least added a footnote 
making “also left-wing dictatorships” explicit). It may have then avoided 
the assumption that, by focusing on right-wing hegemonies, | was 
condoning left-wing ones. I do not. In fact, I agree that, to a greater or lesser 
degree, the over-inclusive epithets “ethnic-nationalist, racist, sexist, anti- 
immigrant, anti-ecological, anti-free speech, post-fact, authoritarian” may 
describe traits of both ends of the political spectrum. While not justifying 
my omission, | explain it by the fact that during this past few years there 
hasn’t been, to my knowledge, any upsurge of left-wing political extremism 
that fit those attributes (with the possible exception of the political scenario 
of a couple of former USSR republics, and a few governments in the process 
of collapsing, such as Venezuela). In contrast, there has been a notable 
expansion of right-wing! populism? both in the Americas (the U.S.A., 
Brazil) and in Europe (noticeably Austria, Belgium, Denmark, France, Italy, 
Norway, Switzerland,’ and in a more extreme fashion, Hungary and 
Poland.) Not all of these movements are in control of their country’s 
government —the exception being the last two mentioned—but they have 
grown remarkably, and dangerously (bringing once again into this discourse 
an opinion, albeit fed by the lessons of history, and shared by many (e.g., 
Wodak, 2015.) 


The other issue bought forth by Dr. Bullard, namely, whether a scientific 
journal should tolerate opinions, is another matter. It echoes a spurious 


| Right-wing: Defined as an ideology that accepts and supports a system of social 
hierarchy or social inequality, with a strong anti-immigrant rhetoric and, broadly 
speaking, supporting curtail of the role of the state, and supporting a neoliberal economy. 
Carlisle, R.P. (2005) Encyclopedia of politics: the left and the right, Volume 2. 
University of Michigan; Sage Reference. p.693 &721 

2 Populism is described by the Cambridge Dictionary as ‘political ideas and activities that 
are intended to get the support of ordinary people by giving them what they want’. It 
includes the usage label ‘mainly disapproving’. https://www.cam.ac.uk ‘news/populism- 
revealed-as-2017-w ord-of-the-year-by-cambridge-university-press . 

3 Datasets: Austrian Legislative Election; Swiss Federal Election, 2011; Norwegian 
Parliamentary Election, 2013; Belgian Federal Election, 2011; Danish General 
Election,2011. In European Election Database. Web 6 Nov.2013. & 
https://en.wikipedia.org/wiki/2018 Italian general election 
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territorial dispute about the legitimacy of the use of the label “science’ 
between mathematic-based “hard” and socio-behavioral (and philosophy, 
and history, and...) “softer” disciplines, and between science and the 
common language (see, e.g., Bertrand Russell, 1958.) Should we erect tall, 
beautiful walls between scientific fields, arguing the impurity or 
dangerousness of our neighbors, or assume that there are gray zones 
between provinces of the field of sciences where rigor and imagination 
combine in fuzzy ways, to everybody’s benefit? “If scientific values 
recognize a plurality of perspective, freedom of expression and political 
negotiation beyond the alliances of the powerful, they would fit with the 
values of liberal democracy. But the banner of ‘scientific values’ could 
equally be raised by an authoritarian technocracy, in which tacit and 
indigenous knowledge is marginalized.” (Hulme, 2009, p.702) 


Science is not “out there,” untouched by the values of the scientist and 
his/her times. Scientific journals can and should have values visible in their 


pages. 
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In January of 2016, I became Vice President for Administrative Affairs 
for the Washington Academy of Sciences. I spent the first year of my 
incumbency trying to learn the job and to understand how I would fit 
into the operational framework of the academy. Although I started with a 
review of the WAS Bylaws, the Academy has a longstanding annual 
business cycle and I was inserted into it at about its midpoint. So while I 
studied the Bylaws, I had to, perforce, keep the wheels turning (while I 
pumped up the tires, so to speak). 


To paraphrase Article | of the Bylaws, the two primary purposes of the 
Academy are to 


e promote the interests of science (small ‘s’, i.e., not the magazine, 
although we are grateful to the AAAS and that magazine for 
their support) in Washington D.C. and its environs, and 


e to provide for information sharing and cooperative activities 
among the members and affiliated societies of the Academy. 


Both purposes are only indirectly influenced by our current 
operational environment, as reflected in the tools and procedures (and 
the people, volunteers all) we have at our disposal for orienting the WAS 
to achieve the goals implied by those purposes. 


Furthermore, it has become clear over time that the operational 
environment is anachronistic and not particularly responsive to changes 
in the Washington science community. I’m in no position to direct or 
steer the WAS (and it’s not my job to do so), but I do hope to make the 
WAS Administration and its actions more visible to DC science in 
general as well as to our membership and affiliates. In the process, 
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perhaps the WAS will become more engaged and engaging to the 
science related organizations and individuals in our local 
MetropolitanStatisticalArea.| 


To get started toward this goal, and as titular operations director both 
of the Academy and of the Journal of the WAS, it seemed appropriate 
that I try my hand writing a column for the Journal. My current intention 
is to do this every quarter, but there’s no telling how that intention may 
swerve over time. Other options are manifold: this space could be used 
for guest essays, or perhaps other members of the Board of Managers 
(“the BOM”) will offer contributions. Certainly, the Journal itself would 
welcome offerings from the WAS membership. Ultimately this may lead 
nowhere, but I hope not. 


My job is described in the Bylaws under Article III. To summarize 
that Article: 


The Admin VP 


e is 3 in rank in the Board of Managers, after the President and 
President-elect, and presides over Board of Managers meetings 
when the President and President-elect are unavailable; 


e manages the business office and is responsible for business 
operations of the Academy and the Journal;,. 


e oversees the Office Manager and Editorial Advisory Committee; 


e absent someone specifically appointed to the role, acts as 
Archivist to maintain the historical records of the Academy. 


That’s pretty much what the Bylaws say, and the BOM tries to 
follow those rules. Overtime, the world has changed since these bylaws 
were written and there are some rather obvious problems with my list 
above. 


1. We don’t have an(other) office manager. For the nonce, I’m it. 


2. Similarly, we don’t have an Editorial Advisory committee. Our 
Journal Editor, Sethanne Howard, advises herself (and does a 


' https://www.bls.gov/regions/mid-atlantic/data/xg-tables/ro3 x95 12.htm for the 
Bureau of Labor Statistics. 
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fantastic job). Moreover, if we had an EA committee, I’m not 
sure what purpose it would serve, except to give me something 
else to do. However, the committee would be useful as a backup 
pool of editorial assistants in the event of a resurgence of Journal 
submissions. 


3. Academy and Journal Operations aren’t actually tied that closely 
together except through their respective finances, which are 
coordinated between me and the Treasurer. 


The Office Manager’s role is primarily then that of an inward facing 
office, with data management responsibilities for keeping track of the 
business cycle (subscriptions and membership data) and, to a lesser 
degree, the publishing cycle (e.g. receiving and retaining Journal 
overprints). 


Outward facing data dissemination (and related data management) 
responsibilities are carried out by several individuals. Currently, our 
Journal Editor, Sethanne Howard, prepares the Quarterly editions of the 
Journal of the WAS and produces our email based newsletters and 
announcements. The Webmaster (a role currently filled by Paul 
Arveson, who also serves as the VP of the Junior Academy), controls 
our WEB content and administers the washacadsci.org email domain. 
Finally, our social media presence is, at the moment, the responsibility 
of our President Elect, Judy Staveley, with guidance and assistance from 
Paul Arveson. 


The Journal Editor is acknowledged in the Bylaws, but there is no 
mention of the Social media, Email administration or Webmaster roles. 
Since all of these duties and _ responsibilities are generally 
undocumented, the person in each role must depend upon word of mouth 
(from anonymous sources, mostly old timers and former officers of the 
BOM) and ‘what feels right’. Ultimately, they must decide for 
themselves how to discharge those duties and meet their responsibilities. 


So, where is all of this heading? This year, the Washington Academy 
of Sciences is 120 years old. As it has aged, it has also evolved and must 
continue to do so. It’s old news that the worldwide adoption of 
electronic technologies means that disruptive changes to enterprise 


Winter 2018 


business models have challenged organizations of all sizes and 
intentions. Our business cycle (the annual cycle of the Academy) begins 
each May with the turnover of the new BOM, led by a new President 
and President Elect. Each of the other officers may remain in office 
indefinitely, subject to their continuing to appear on the annual ballot. 
It’s that property of incumbency that allows the presiding officers to be 
replaced without sacrificing continuity of knowledge and understanding 
of the Academy’s processes. 


Establishing and documenting how the Academy can deal with our 
changing world is a responsibility we all share. As Admin VP I am 
responsible for coordinating the data and office management processes 
of the Academy and for projecting how those processes are documented 
and shared with the Academy membership. I must also be a collector of 
insights into the changes the Academy must undergo to remain relevant 
and useful for the DC science community. I find that the one day a week 
that I can support this office doesn’t allow much time for an Enterprise 
Architecture effort. Such an effort would, I believe, be the expected, 
contemporary strategy for an organization to address issues of 
transformation in the face of disruptive change. So, I invite all readers of 
this Journal in the DC area with free time to travel downtown to 
correspond with me about supporting either the Office Management or 
Enterprise Architecture efforts. I welcome any suggestions from anyone 
as to how best to deal with the situation I’ve described (or provide a 
better understanding of the current status of the WAS). My email is: 


admin(@washacadscl.org 


The preceding summary has focused on the current work of the 
Administration function as it relates to Office Management functions. 
I’ve not said anything about the Archive responsibilities. As a member 
of an ISO committee responsible for Digital Archive standards (ISO 
14721, and related ISO 16363 and 16919) it’s embarrassing that I’ve not 
spent more time on this aspect of the Admin job. My only excuse is that 
my ISO focus area (Digital Archives) doesn’t really include the WAS 
archives, which are mostly hardcopy. However, if anyone out there has 
access to a system to convert paper documents to PDF files, I’d like to 
talk to you, too. 
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Over the course of the next year, I plan to write more about the how 
the Admin functions are executed and how they support and complement 
the activities of the WAS. 


Terry Longstreth (AKA Wallace Isaac Longstreth, III) 
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Book Review 


Accessory to War: The Unspoken Alliance between 
Astrophysics and the Military 


Authors: 
Neil de Grasse Tyson and Avis Lang 
Norton, 2018 ISBN 978-0-393-6-06444-5 


The popular conception of astronomy and astrophysics is as an “ivory 
tower” pursuit. The further understanding of the universe and processes in 
it are considered as exploration for its own sake. The authors show that this 
is not the case at all. Here is a summary of some areas which are described 
in much more detail in the book. 


By going back to the pretelescopic era of naked eye astronomy, the 
book describes how, in the royal courts of Middle Ages Europe, the 
astronomer was also an astrologer. Horoscopes were cast to determine the 
proper dates and times for multiple activities, including war. Accurate 
predictions of the planets, Sun, and Moon were crucial in casting 
horoscopes. The development of improved planetary predictions by famous 
astronomers such as Copernicus, Tycho Brahe, and Kepler had at their root 
the practical motivation to cast better horoscopes. This role independently 
originated in separate ancient civilizations such as China, India, and even 
the Mayan civilization of Central America. Astronomers separated 
themselves from astrology as it became clear that the stars and planets were 
so far away as to have little effect on Earth-bound life. One quibble this 
reviewer has is that the physical reason for this abandonment was not clearly 
explained in the book. The Sun and Moon are exceptions to this lack via the 
non-astrological tides and the seasons. One characteristic of science is 
“reproducibility”. Venus in European astrology was the goddess of love 
while the Mayans thought of it as a terrible god of war, completely different. 


Beyond astrology, astronomy played a crucial role in the age of 
colonial empires, from the Renaissance to the 20" century. Columbus’ 
application of the discovery of the spherical shape of the Earth to the 
conquest of the New World is well described in the book. But Columbus 
was not the first. The spherical shape was well known to the ancient Greeks 
and used in determination of latitude angle from the equator even by the 
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Vikings via angles of stars and the sun above the horizon. The hard problem 
was determination of longitude solved in the 18'" century by the 
development of accurate ship-board clocks. Local time found by 
astronomical observations was compared to the time at, say, Greenwich 
England as preserved by the clock. It was now possible to discover the 
locations of new lands to settle or conquer and return home afterward. Soon 
there were observatories in every major port to study the stars and, more 
importantly, accurately determine time. Perhaps this gave astronomers 
alternative employment to casting horoscopes! The book has an excellent 
description of the other indispensable tool of sailors, the compass. World- 
wide observations of deviations of the compass from true north were made 
by astronomers such as Edmund Halley. The book makes clear that for 
better or worse (trade of new products or the slave trade), astronomy played 
a crucial role in the age of sea-based empires. 


Then turning to the telescope, one thinks of great astronomical 
discoveries such as Galileo’s first great observations of craters on the Moon 
or the satellites of Jupiter. However, this book makes a detailed case that 
the telescope was first seen as an instrument for war from the very start. 
Galileo himself promoted its use to identify, for example, distant enemy 
ships. Beyond simple optical telescopes, a crescendo is described of 
refinements resulting in instruments today that would be unrecognizable as 
telescopes to an old fashioned optical astronomer. Today telescopes use not 
only the invisible electromagnetic wavelengths of UV, infrared, and radio, 
but even gravitational waves from merging black holes and neutrinos from 
the cores of active galaxies. 


The latter part of the book is a very detailed description of what I 
shall call the modern day weaponization of space. Expenditures for these 
hidden activities are very large compared to the much better known 
scientific explorations. Thankfully, the 1963 Limited Nuclear Test Ban 
Treaty has led to the exclusion of nuclear weapons in space thus far. Related 
to the Test Ban Treaty, in an interesting transfer from the military to 
astronomy, a military satellite detected gamma rays thought to be from 
treaty breaking nuclear tests. There was great concern until astronomers 
were able to verify that the rays were not from the Earth but from other 
galaxies. Thus was born a new area of astronomy. 
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Despite the Test Ban Treaty, other types of non-nuclear hostile 
weapons such as hunter killer satellites have been tested.by the United 
States and China and are now being developed by other countries. The 
recent talk of a United States space force is merely a combination under one 
command of many already existing efforts. Today, a worrisome impact 
threat to orbiting satellites is “space junk”: debris from exploded satellites 
or other space activities. Today, we are so dependent on satellites from 
communication to GPS navigation that a flood of debris and attacks from 
even a non-nuclear space war would, in the words of the book’s authors, 
“be terrifying.” Today, worry about these terrible effects has resulted in a 
stalemate of sorts. In addition to diplomacy, the authors hope that education 
and better scientific understanding may avert a terrible future. 


As a counter point to this book’s theme, recent astronomical 
research has revealed beauty and a story which the general public seems to 
value for itself with no military benefit. An example of the beauty revealed 
is the famous “Pillars of Creation” photograph by the Hubble Space 
Telescope of glowing gas clouds and forming stars beautiful even to those 
who do not know what is happening in the photo. As a result of such images 
a successful campaign was launched to keep the Space Telescope in 
operation. Another scientific trend of no foreseeable military benefit are 
revelations that there is a story connecting us personally to a sequence of 
events reaching to the origin of the universe. For example, the iron in our 
blood was created in exploding supernovas, and the hydrogen in the water 
in our blood was created in the Big Bang itself. 


In a final note, this book is very detailed with footnotes making up 
a significant portion of the book. Probably it should be read in smaller 
chapter-by-chapter doses rather than straight through. There is a trap (into 
which this reviewer has fallen personally) of having extensive knowledge 
of a subject which is all presented in an overwhelming manner. Although 
one of the co-authors is an editor, this book needed a good editor to create 
a version emphasizing the most important facets in a more digestible form 
for the lay reader. I would recommend this book to a layperson who is 
already well read in astronomy or space science. 


Gene G. Byrd, Professor Emeritus, University of Alabama 
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Analyzing the Accuracy and Effectiveness of the EVI- 
PAQ Trajectory Laser to Determine Bullet Trajectory 
Reconstruction 


Lydia Chance 


Frederick Community College 


Abstract 


By following the written protocol on the use of a trajectory laser pointer, 
we weighed its benefits when applying to a crime scene investigation. We 
followed the standard protocol listed in the EVI-PAQ Trajectory kit and 
demonstrated finding the angles of trajectory of blood droplets and/or 
bullet holes. There are several methods to find an angle of trajectory, such 
as using protrusion rods or a string alongside a protractor. Evidence 
gathered is only as useful as the photographs taken to document it. 
Therefore, ensuring that all photos taken are clear is a necessity. This 
experiment in recreating a crime scene emphasizes the usefulness of the 
trajectory laser and provide an in-depth review on its use in criminal 
justice settings. 


Introduction 


ALTHOUGH THERE ARE ACCURATE WAYS to reconstruct the pathway a 
bullet took through the air upon firing, I demonstrate the accuracy and easy 
maneuverability of modern equipment such as a trajectory laser over the use 
of string and protrusion rods alone. Implementing this modern method of 
visualization in a crime scene recreation provides an invaluable experience 
by showing the precision of the current techniques being used by crime 
scene investigators of today. Using the EVI-PAQ Trajectory Kit one can 
determine the point of origin of a shooter based on the angle of a bullet hole. 
This laser kit contains several methods to determine the angle at which a 
bullet struck a surface. 


The reconstruction of bullet trajectory is often the last step in 
recreating a crime scene, but that does not make it any less important than 
collecting other forms of evidence. There are crime scene investigators who 
work specifically in ballistics and specialize in interpreting the data 
gathered by the trajectories and then speculating where the shot originated. 
Often, the bullet trajectory will tell where a shooter was standing, and it is 
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reliable within the first 50 yards of travel for the bullet without having to 
account for other variables such as gravity, air resistance, and yaw. 


This evaluation consists of three phases: setting up the trajectory 
laser using the components of the EVI-PAQ kit; demonstrating the ease of 
use; applying the equipment to determine the trajectory of a bullet hole. 


Methods 


1. Acquire the Materials 
1.1. The EVI-PAQ Kit mandates the use of the following materials and 
equipment. The kit contains several methods for testing the 
trajectory of a bullet hole, however, the laser will be used for this 
experiment. 
1.1.1. Trajectory rod kit with trajectory laser pointer 
1.1.2. Protrusion rods 
1.1.3. Protractor or angle finder 
1.1.4. Reflective card 
1.1.5. Camera with adjustable aperture and exposure 
1.1.5.1. | One may require the use of photographic fog if the 
area used for the laser is not dark enough or if there are 
not enough small particles off which the laser could 
reflect in midair. 
1.1.6. Tripod 
1.1.7. Wood board prepped with bullet holes 
1.1.7.1. Bullet holes must be wide enough for the protrusion 
rod; .22 caliber bullets may be too small 


2. Photographing the Scene and Bullet Holes Before the Trajectory System 
is Placed 
2.1. Photographs of the scene must be taken before any obstruction 
contaminates the crime scene. 
2.2. Photographs of the bullet holes taken from each side with a scale 
must be acquired. 

2.2.1. Consider photographing the entire affected area if the bullet 
holes are spread throughout on the same surface, then take the 
close-up images with the scale. 

2.2.2. Photographs should be taken from all angles. 

2.2.2.1. Photograph the bullet holes from an angle 
perpendicular to the hole, parallel to the surface on the 
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horizontal, and parallel to the surface from above if space 
permits. 


3. Preparing the EVI-PAQ Trajectory Laser and Protrusion Rods 
3.1. The laser fastens to the end of the protrusion rod by screwing the 
threaded end piece into the end of the rod. If necessary, one may 
tighten a fastener to the laser and the rod to ensure stability. 
3.2. Place the board with the bullet holes upright so that it is supported 
on a flat surface. 

3.2.1. The angle of the bullet hole in the board should lead the laser 
to a point of origin that is non-reflective to avoid error in the 
calculation of the angle. 

3.3. Set the protrusion rod into the first bullet hole and push through 
until the rod rests on the flat surface and its balance stabilizes. 

3.4. Fasten the protrusion rod to the surface of the board if necessary. 

3.5. Photograph the protrusion rod, after it is stabilized, from several 
angles. 


4. Lighting and Camera Settings 
4.1. The lights in the room of the laser pointer must be shut off in order 
to see it most clearly once turned on. Unless the scene is outdoors 
and can be shot at night time, the lights must be off. 
4.2. The camera should be prepared to take the photographs of the laser. 

4.2.1. Use a long exposure to ensure the light of the laser is shown 

clearly 
4.2.1.1. An exposure may last up to three minutes to gather 
the largest amount of light possible for clear photographs. 

4.2.2. The tripod should be placed to capture the entirety of the 
laser’s path. 

4.2.2.1. The laser can project up to 5,000 feet but other 
aspects of trajectory must be considered for any distance 
greater than 50 yards and must be addressed during 
calculations of an origin point. 

4.2.3. The timer on the camera may be set to take the photograph 
to avoid any shaking clicking down the capture button may 
have caused. 

4.2.3.1. Two seconds should give ample time for the camera 
to steady itself on the tripod after being pressed. 
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Turning on the Laser and Documenting Angles 

5.1. Turn on the laser 

5.2. Hit the capture button on the camera 

5.3. Wait until the exposure stops before moving any aspect of the 
Scene: 

5.3.1. Moving the camera while the shutter is still capturing the 
light will result in a blurred photograph containing a light trail 
of the laser beam and will have to be redone. 

5.4. Use a protractor to measure the angles 

5.4.1. To produce an angle of trajectory, there must be two points 
from which to measure. The entrance and exit can be used to 
measure the angle of impact in thick materials. 

5.4.1.1. | The bullet hole may also aid in producing the angle. 

5.4.2. The laser will point to the third point necessary for 
determining the point of origin or may pass through the origin 
and continue on if the scene is large enough. 


Figures | through 8 illustrate the various steps taken for documentation 
and the methods applied to reconstruct the bullet trajectory. 


Figure | The wood board is photographed from multiple angles with a scale 
(white strip) to document the bullet holes. 
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Figure 2 The Prctiaciontre rod i is se merpengs in me rae hole ai the t fairey laser 
attached to the end. 


Figure 3 The ae of the Snir rod has a bullet tip mounted and it is placed 
into the wood board to begin the angle recreation process. 
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Figure 4 The camera is placed to capture the entire length of the laser's path. 


Washington Academy of Sciences 


c 


. Government Veteran Owned, § 


oS “Scientific Source of Lab Equipmen 


x Scientific” $00.248.8030 fax 7 


Be ees 


Figure 5 The second cluster of holes had three entry and exit holes. 
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Figure 6 Bullet hole C was measured from multiple angles. 
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Figure 7 The third grouping had two bullet holes. 
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Figure 8 The bullet holes within the third cluster were too small for the 
protrusion rod to pass through and must have the angle of trajectory measured 
with string and thinner protrusion rods than the ones in the kit. 
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Conclusion 


By using a trajectory laser kit the point of origin is easier to 
visualize, and the angle of trajectory can be measured. There was an 
instance when the .22 ammunition hole did not allow the protrusion rod to 
pass through; therefore, using the trajectory laser was not possible for this 
example. Using a protrusion rod connected to the trajectory laser makes it 
easier to measure the angle of trajectory from the wood and allows for clear 
documentation of the angle with a protractor. Although the trajectory laser 
is useful over long distances, it is often difficult to see outdoors or in well- 
lit areas. The stability of the trajectory laser depends on the user holding the 
camera button down and often results in wobbling as the exposure of the 
camera starts. If the room is dark enough, the exposure of the camera can 
be modified to let the correct amount of light in and still reflect the green of 
the laser’s light through a white reflective card showing the position of the 
laser in the scene. There is no way to determine exactly where the shooter 
was standing because the laser will shoot from the endpoint of the bullet 
hole to whichever hard surface it next comes in contact. Further speculation 
allows crime scene analysts to determine the ultimate position of the shooter 
by accounting for all the information gathered in the crime scene 
reconstruction. 
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Reusable Models of Manufacturing Processes for 
Discrete, Batch, and Continuous Production 
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Abstract 


This article explores the new ASTM E3012-16 International Standard Guide 
for Characterizing Environmental Aspects of Manufacturing Processes, its 
application and potential impact in the manufacturing industry. The standard 
provides guidance for industries to examine unit manufacturing processes, 
capture characteristics in terms of how they impact the environment, and 
explore opportunities to be efficient and sustainable in their operations. The 
standard further encourages formal representations for consistent and effective 
deployment of manufacturing tools and reuse of data and information models 
for automated analysis. 


Introduction 


TO REMAIN COMPETITIVE manufacturers today seek to improve 
productivity while maintaining quality and meeting sustainability 
objectives. With the manufacturing sector consuming a large percentage of 
our national resources, smart manufacturing and sustainable manufacturing 
implementations through process optimization hold tremendous potential 
for improvement!*?:4, Being cognizant of the production improvement 
opportunities is key to success. But where do we start? Starting at the 
process level poses an opportunity — an opportunity to improve process 
performance through the meticulous understanding of selected processes. 


‘Mani, M., Madan, J., Lee, J. H., Lyons, K. W., & Gupta, S. K. (2013). Review on 
Sustainability Characterization for Manufacturing Processes. National Institute of 
Standards and Technology, Gaithersburg, MD, Report No. NISTIR, 7913 

* Haapala, K.R., Zhao, F., Camelio, J., Sutherland, J.W., Skerlos, S.J., Dornfeld, D.A., 
Jawahir, I.S., Clarens, A.F. and Rickli, J.L., 2013. A review of engineering research in 
sustainable manufacturing. Journal of Manufacturing Science and Engineering, 135(4), 
p.041013 

3 Stephan Mohr, Ken Somers, Steven Swartz, and Helga Vanthournout, Manufacturing 
resource productivity, McKinsey Quarterly, June 2012. 

4 https://itif.org/publications/2018/1 1/28/innovation-agenda-deep-decarbonization- 
bridging-gaps-federal-energy-rdd 
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Eventually, these individual opportunities can be harnessed at a systems 
level where multiple manufacturing processes work in concert. 
Characterization of process-level activities can empower better engineering 
at higher levels of manufacturing automation and control. These control 
levels are described in the widely-acknowledged enterprise to control 
system hierarchy (ISA 95°). Besides this, the ISO 14000 family® of 
environmental management standards are useful towards developing a 
management approach to sustainability and retroactively comparing the 
impacts of different comparable products. But, specific guidance for 
manufacturers to characterize individual processes and _ identify 
opportunities for improvement can be an added advantage. To provide such 
guidance for industries to examine basic manufacturing processes (a.k.a. 
unit manufacturing processes) ASTM International’ issued a set of 
standards, including E2979-18°, E2986-18°, E2987-18'°, E3012-16!', and 
E3096-18'*. These standard guidelines help manufacturers scrutinize and 
capture the characteristics of individual processes in terms of how they 
impact the environment, and look for opportunities to be more sustainable 
in their operations. 


This article specifically explores the new ASTM £301/2-16 
International Standard Guide for Characterizing Environmental Aspects of 
Manufacturing Processes'* and its consideration for use with discrete, 
batch, and continuous production. The standard provides guidance for 
industries to examine unit manufacturing processes, capture the 


5 https://www.isa.org/isa95/ 

° https://www.iso.org/iso-14001-environmental-management.html 

7https://www.astm.org/ 

8 ASTM International (2018). E2979-18: Standard Classification for Discarded Materials 
from Manufacturing Facility and Associated Support Facilities. 

° ASTM International (2018). E2986-18: Standard Guide for Evaluation of 
Environmental Aspects of Sustainability of Manufacturing Processes. 

'0 ASTM International (2018). E2987/E2987M-18: Standard Terminology for 
Sustainable Manufacturing. 

'! ASTM International (2016). E3012-16 Standard Guide for Characterizing 
Environmental Aspects of Manufacturing Processes. 

'2 ASTM International (2018). E3096-18 Standard Guide for Definition, Selection, and 
Organization of Key Performance Indicators for Environmental Aspects of 
Manufacturing Processes 

'3 https://www.astm.org/E3012-16.htm 
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characteristics of those processes in terms of how they impact the 
environment, and look for opportunities to be more sustainable in their 
operations and improve their efficiency. The standard also encourages 
standard representations for consistent and effective deployment of 
manufacturing tools and reuse of data and information models. 


Current Gaps and Potential for Standards 


Several workshops '*>!> facilitated by the National Institute of 
Standards and Technology (NIST)? across the U.S. have reiterated the 
viewpoint that gaps exist in terms of measurement capabilities to connect 
sustainable manufacturing practices with the promotion of resource 
efficiency. Today’s practices for sustainability-related analysis for products 
do not explicitly account for individual manufacturing processes. Current 
practices fall short in promoting a science-based understanding of 
individual processes critical for their performance improvement and 
decision making'®'’, Formal methods for collection and consolidation of 
sustainability related information on manufacturing processes is lacking. 


The measurement science—including methods for process 
description, performance metrics, and a corresponding information base for 
unit manufacturing processes—will allow for a more consistent evaluation 
of sustainability performance across manufacturing systems. Providing the 
science in the form of best practices is a goal for the ASTM International 
standard. 


14M. M. Smullin; K. R. Haapala; M. Mani; K.C. Morris. ‘Using industry focus groups 
review to identify Challenges in sustainable assessment theory and practice.” ASME 
International Design and Engineering Technical Conferences & Computers and 
Information in Engineering Conference, Charlotte 2016 

ISW.Z. Bernstein ef al., 2018. ‘Research directions for an open unit manufacturing 
process repository: A collaborative vision,’ Manufacturing Letters, 15 (B), pp.71-75 

16 M, Mani, Madan, J., Lee, J. H., Lyons, K. W., & Gupta, S. K. (2014). Sustainability 
characterization for manufacturing processes. International Journal of Production Research, 
52(20), 5895-5912. 

17 Duflou, J.R., Sutherland, J.W., Dornfeld, D., Herrmann, C., Jeswiet, J., Kara, S., Hauschild, M. 
and Kellens, K., 2012. Towards energy and resource efficient manufacturing: A processes and 
systems approach. CIRP Annals-Manufacturing Technology, 61(2), pp.587-609 
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ASTM International Standards on Sustainability 


ASTM International is a global leader in the development of 
voluntary consensus standards. ASTM International formed the E60.13 
Subcommittee on Sustainable Manufacturing to guide industry in best 
practices to inform sustainability-related decisions. More information on 
the standards published through this committee can be accessed from the 
committee website ' . The E60.13 E3012-16 standard defines a 
methodology to develop unit manufacturing process or UMP information 
models. The standard contributes to the measurement science needed to 
quantify sustainable manufacturing practices to the benefit industrial 
competitiveness. Standard methods for describing the environmental 
choices that a manufacturer makes allow them to improve their practices 
and to differentiate themselves from the competition. 


Application of the standard benefits manufacturing practices in two 
ways. First, it raises consciousness about manufacturing processes, their 
environmental impacts, and opportunities for their improvement. The goal 
of applying the standard is to improve the environmental aspects of the 
process through the definition of key performance indicators specific to an 
individual process addressing potential enterprise level goals. Establishing 
that rigor sets the stage for better informed decision-making and production 
planning. 


The new ASTM standard provides guidance to help manufacturers 
effectively understand processes, capture process characteristics in terms of 
decision making and, as a result, leads to more sustainable systems. 
Secondly, the use of standard practices and formal representation methods 
poises manufacturers for transition into scientific modeling environmental 
impact, and identify opportunities for improvement. Characteristics of a 
processes imply descriptions of what goes into and out of the process, how 
the process transforms its inputs to outputs, and what types of information 
is used in the transformation. The standard format defined in ASTM E3012- 
16 provides a basis for ensuring that a specific set of details are defined and 
that they are covered in a consistent manner. See Figure 1. In this way, the 
standard offers a method to generate reusable constructs (UMP information 
models) that provide a structured way of both understanding and specifying 


18 https://www.astm.org/COMMIT/SUBCOMMIT/E6013.htm 
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unit manufacturing processes. Such constructs presented in an abstract and 
precise manner can be parameterized and reused in different application 
contexts like information processing, simulation, and analysis. The standard 


makes for better comparisons, increased reuse, and, in the end, more reliable 
results. 


Physical World : Digital World 
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Figure. | Overview of the significance and use of this standard. UMPs store digital 
representations of physical manufacturing assets and systems to enable engineering 
analysis, e.g., optimization, simulation, and life cycle assessments. 


Potential Impact 


ASTM E3012-16 is a good starting point for creating reusable 
descriptions of manufacturing processes that will ultimately realize process 
analytics and tool integration. In addition to systematic characterizations of 
processes, the formal representations for those characterizations support the 
direct use of the information within a variety of applications. The most basic 
application is to support effective communication by ensuring consistency 
and completeness. More advanced applications include computational 
analytics and comparison of performance information. The formal 
information model described in the guide facilitates new software tool 
development to link manufacturing information and analytics for 
calculating environmental performance measures. Further, the standard 
format paves the way for more specific software tools supporting the 
development and extension of standardized data and information bases such 
as Life Cycle Inventory (LCI). LCI data is extensively used in life cycle 
assessments (LCA), part of the 14000 family of standards. The top down 
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approach of the ISO 14000 family and the bottom up measurements 
approach from ASTM standards are complimentary. 


Formally defined UMP models can cater to different user 
information from a variety of perspectives. For example, using the standard, 


e a variety of stakeholders, e.g., plant managers, process engineers, 
technicians and operators, can better understand and communicate 
manufacturing processes through consistent and tailored views of 
the model; 

e manufacturing engineers can develop system models from the unit 
manufacturing processes by linking them together to characterize 
specific production plans for discrete batch or continuous 
production; 

e systems integrators can use models of manufacturing processes to 
understand material and information flows, and 

e manufacturers can capture their own data for LCA-based 
environmental assessments by developing data sets representing the 
environmental impacts of their unit processes, complimenting and 
sharpening LCI data sets. 


In a related work, the authors explored the use of the standard with 
three use cases in the pulp and paper industry. The case studies showed the 
utility of the draft standard as a guideline for composing data to characterize 
manufacturing processes. The data, besides being useful for descriptive 
purposes, was used in a simulation model to assess sustainability of the 
manufacturing system.!??° 


Scope of the Current Standard and Beyond 


Leveraging unit process models is by no means a new idea to 
continuous process industries, such as the Chemical Industry. For nearly a 
century, mathematical representations of “unit operations,’ such as 


'? Mani, M., Larborn, J., Johannson, B., Lyons, K., & Morris, KC. (2016). Standard 
representations for sustainability characterization of industrial processes. Journal of 
Manufacturing Science and Engineering 

0 Rebouillat, L., Barletta, I., Johansson, B., Mani, M., Bernstein, W.Z., Morris, K.C. and 
Lyons, K.W., 2016. Understanding sustainability data through unit manufacturing 
process representations: a case study on stone production. Procedia CIRP, 57, pp.686- 
691 
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filtration, evaporation, humidification and distillation, have been derived 
for controlling both small-scale plants and industrial installations. 7! 
Considering the longevity of the unit process-based approaches in Chemical 
Engineering’, the authors envision its direct relevancy to the process 
industry and beyond. The hope will be that the formal characterization of 
UMPs across diverse industries would enhance existing analysis 
frameworks, such as improving the precision of life cycle assessment, a 
method that still is burdened with significant uncertainty.77ASTM E3012- 
16 is designed to be relevant across different production types, including 
discrete, batch, and continuous. The standard provides a fundamental 
representation to support unit manufacturing process in all of these 
production settings. Characterizing the bounds of each unit manufacturing 
process drives insight into each process’s functional characteristics. 


The current standard is a first step to facilitate studies of existing 
processes and to make those studies more accessible in the future. It can 
serve as the basis for the development of production system models to better 
understand process flows and interactions between and across different 
processes. A repository of UMP models can be used for planning both to 
retrofit existing facilities or for new facilities. Designs for new facilities are 
almost always based on prior experience with operating processes and 
realistic models should prove useful especially for verification and 
validation activities. 


The perceived scientific benefits to manufacturers from application 
of the standard include reduced operational costs, improved prediction of 
product costs, improved schedule, maximization of manufacturing 
resources, improved control of product quality, and incorporation of best 
practices. Modeling individual manufacturing processes facilitates the 
generation of quantifiable evidence that improvements are being made. The 
standard provides a uniform and repeatable way for more practitioners to 
reap these benefits. 


21 Walker, W.H., Lewis, W.K. and McAdams, W.H., 1923. Principles of chemical 
engineering. London: McGraw-Hill Publishing Co 

22 Turton, R., Bailie, R.C., Whiting, W.B. and Shaeiwitz, J.A., 2008. Analysis, synthesis 
and design of chemical processes. Pearson Education. 

23 Jacquemin, L., Pontalier, P.Y. and Sablayrolles, C., 2012. Life cycle assessment (LCA) 
applied to the process industry: a review. The International Journal of Life Cycle 
Assessment, 17(8), pp. 1028-1041 
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The standard will be of interest to software providers across 
industries interested in providing analysis and modeling/simulation 
solutions to manufacturers. The standard format promotes information 
exchange and communication through digitalization of manufacturing 
assets for decision making purposes. Moving forward, with contribution 
from industries, future standards can encompass a broader set of processes 
and functionalities using ASTM E3012-16 as a platform on which to build. 
Further, the creation of a repository of models should reduce modeling time 
and improve model verification and validation activities. The creation of a 
repository of models also provides a forum for industries to come up with 
best practices and target sets of UMP models for common processes as 
reference data.” 


Future Work and Conclusions 


As a relatively new standard co-developed by supportive 
manufacturers, the ASTM task group is now seeking more participation 
from across industries, especially SMEs, to demonstrate and further 
improve the standards. The standard has already received some attention 
and efforts are underway to spread the word. Much of the vision for the 
work will require further research and future standards based on real world 
experience”. UMP-focused industrial case studies are of interest to the task 
group. NIST has already hosted two competitions, and will host a third, to 
apply the standard to existing process models.*° This resulted in a diverse 
set of models and focused attention within the educational world. To realize 
the promise of reusing such models and automating analytics and system 
integration for manufacturing significant research challenges remain 
including advancements in the following areas 


e Knowledge and understanding of UMP modeling. This includes 
novel formal representations and methodologies, more accurate or 
specialized metric, metric representations that support cascading to 


“4 W. Z. Bernstein; M. Mani; K. W. Lyons; K.C. Morris; B. Johansson. ‘An Open Web- 
Based Repository for Capturing Manufacturing Process Information.” ASME 
International and Design and Engineering Technical Conferences & Computers and 
Information in Engineering Conference, Charlotte 2016 

*W.Z. Bernstein et al., 2018. ‘Research directions for an open unit manufacturing 
process repository: A collaborative vision,’ Manufacturing Letters, 15 (B), pp.71-75 

6 https://www.nist.gov/news-events/events/2018/01/ramp-reusable-abstractions- 
manufacturing-processes 
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higher production levels, or exploration of variations for families of 
UMP models. 

Standards supporting models reuse. This includes automated 
methods that allow linking of UMP models into systems, facilitating 
system composition through naming conventions or other methods, 
generalization that unifies a collection of processes, or standards-based 
methods for integration with applications. 

Techniques for development and validation of UMP models. This 
includes demonstration of validation techniques for the effectiveness 
and accuracy of the UMP models or techniques for producing useful 
derivatives of UMP models or creative methods for mining 
documentary model descriptions into formal representations. 


As more groups apply the standard in their domains, the shared 


experience will provide a basis on which to further understand 
standardization needs and opportunities. Formal methods for acquiring and 
exchanging information about manufacturing processes will lead to 
consistent characterizations and help establish a collection for reusable 
models. Standardized methods will ensure effective communication of 
computational analytics and sharing of sustainability performance data. 
NIST is also looking for manufacturers to collaborate on pre-pilot projects 
to contribute to the collection of use cases for the standard. In conclusion, 
the use of a reusable standard format should result in models suitable for 
automated inclusion in a system analysis, such as a system simulation model 
or an optimization program 


Winter 2018 


30 


Bios 


Mahesh Mani is a Senior Technology Adviser with Allegheny Science and 
Technology supporting the Advanced Manufacturing Office of the 
Department of Energy. His research interests include smart, sustainable and 
additive manufacturing. 


KC Morris leads a group at the National Institute of Standards and 
Technology focused on standards to infuse smart technologies into the 
manufacturing sector while ensuring that new practices lead to more 
competitive and sustainable manufacturing. Currently, KC is on detail to 
the US House of Representatives serving as an ASME Manufacturing 
Fellow. 


Kevin W. Lyons recently retired from the National Institute of Standards 
and Technology. His research interests include sustainable manufacturing, 
nano manufacturing, design, process modeling, assembly, virtual assembly, 
and additive manufacturing technologies. 


William Z. Bernstein is a research engineer at the National Institute of 
Standards and Technology. Dr. Bernstein currently leads the Product 
Lifecycle Data Exploration and Visualization project. His research interests 
include advanced visualization, information modeling, and sustainable 
manufacturing. 


Washington Academy of Sciences 


31 


Generating Domain Terminologies using Root- and 
Rule-Based Terms’ 


Jacob Collard', T. N. Bhat”, Eswaran Subrahmanian>*, Ram D. Sriram?, 
John T. Elliot*, Ursula R. Kattner’, Carelyn E. Campbell’, Ira Monarch4 


Independent Consultant, Ithaca, New York', 

Materials Measurement Laboratory, National Institute of Standards and Technology, 
Gaithersburg, MD”, 

Information technology Laboratory, National Institute of Standards and Technology, 
Gaithersburg, MD °, 

Carnegie Mellon University, Pittsburgh, PA‘, 

Independent Consultant, Pittsburgh, PA® 


Abstract 


Motivated by the need for flexible, intuitive, reusable, and normalized 
terminology for guiding search and building ontologies, we present a general 
approach for generating sets of such terminologies from natural language 
documents. The terms that this approach generates are root- and rule-based terms, 
generated by a series of rules designed to be flexible, to evolve, and, perhaps most 
important, to protect against ambiguity and standardize semantically similar but 
syntactically distinct phrases to a normal form. This approach combines several 
linguistic and computational methods that can be automated with the help of 
training sets to quickly and consistently extract normalized terms. We discuss how 
this can be extended as natural language technologies improve, and how the 
strategy applies to common use-cases such as search, document entry and 
archiving, and identifying, tracking, and predicting scientific and technological 
trends. 


1. Introduction 


1.1 Terminologies and Semantic Technologies 


SERVICES AND APPLICATIONS ON THE WORLD-WIDE WEB, as well as 
standards defined by the World Wide Web Consortium (W3C), the primary 
standards organization for the web, have been integrating semantic 
technologies into the Internet since 2001 (Koivunen and Miller 2001). 
These technologies have the goal of improving the interoperability of 


| Commercial products are identified in this article to adequately specify the material. 
This does not imply recommendation or endorsement by the National Institute of 
Standards and Technology, nor does it imply the materials identified are necessarily the 
best available for the purpose. | 
Special thanks to Sarala Padi for assistance in compiling and presenting this document 
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applications on the rapidly growing Internet and creating a comprehensive 
network of data that goes beyond the unstructured documents that made up 
previous generations of the web (Berners-Lee, Hendler, and Lassila 2001; 
Feigenbaum et al. 2007; Swartz 2013). These semantic technologies can be 
used to protect against ambiguity and reduce semantically similar but 
syntactically distinct phrases to normalized forms. Semantically normalized 
forms allow users to more easily interact with data and developers to reuse 
data across applications. However, most of these technologies rely on data 
that have been annotated with semantic information. Data that were not 
designed for use on the semantic web most likely do not include this 
information and are therefore more difficult to integrate. For example, 
scientific research papers are typically text- and graphics-based documents 
designed to be read and processed by humans. These documents do not 
usually contain semantic markup, meaning that search engines may not be 
able to take advantage of such advances in data technologies. 


Another major issue in semantic computing is the representation of 
domain-specific semantics. Different scientific and academic disciplines, as 
well as other spheres of communication such as conversation, business 
interactions, and literature, all have overlapping vocabulary. However, the 
same words often have different meanings depending on the domain. In the 
sciences, each field (and often each subfield) has its own terminology that 
is not used in other disciplines, or that conflicts with the language of more 
general-purpose communication. For example, in general use, the word 
fluid typically includes liquids, but not gases. However, in physics, gas and 
liquid are both hyponyms of fluid. Semantic technologies may assume that 
two annotations with the same name have the same semantics, when this is 
not necessarily the case. 


Generally speaking, this issue stems from the problem of 
coordination described by Clark and Wilkes-Gibbs (1986). Any participant 
in a system of communication is typically missing some of the knowledge 
held by other participants. Because of this knowledge gap, participants may 
not understand one another unless they are using a shared knowledge system 
and a shared vocabulary. Clark and Wilkes-Gibbs (1986) describe how 
speakers establish a common ground, collaborating to ensure that 
participants know one another’s strengths and limitations. Interactions 
involving computers are also systems of communication, and must also 
coordinate in order to ensure that all applications are communicating 
properly. Establishing common ground across fields is supported by 
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standardizing terminology and having hyponymic and other semantic 
relations structured for use by humans and machines. 


Issues of coordination are relevant in many fields, particularly when 
it comes to data re-use and interoperability. For example, The Minerals, 
Metals, & Materials Society (TMS) describes many gaps and limitations in 
current materials science standards; one of these gaps is an “insufficient 
number of open data repositories,” referring to repositories containing data 
that can be used by many applications, with the stipulation that data not only 
be available, but also be re-usable (“Modeling Across Scales” 2015). This 
is impossible without some means of coordination and standardization of 
terminology. TMS recommends developing initiatives to aid in 
coordination, for example by engaging “a multidisciplinary group of 
researchers to define terminology and build bridges across disciplines.” 
Multidisciplinary coordination is a necessary part of improving data 
reusability, but support of such coordination is lacking. 


Our goal in this paper is to describe a general system that is capable 
of automatically creating standardized terminologies that will be useful for 
developing domain ontologies and other data structures to fill this gap. A 
domain ontology is a collection of concepts (represented as terms) and 
relations between them that correspond to knowledge about a particular 
family of topics. Our system will take into account potential issues in 
terminology generation, including the disambiguation and normalization of 
terms in a domain. In many cases, a single term can be expressed in many 
different ways in natural language. For example, in mathematics the phrase 
“without loss of generality” has a technical meaning; however, a researcher 
might also write “without any loss of generality” or “without losing 
generality.” All three variants refer to the same technical phrase and should 
be treated together in a standard terminology. Terminologies also face 
issues relating to polysemy, syntactic ambiguity, context sensitivity, and 
noisy data. 


The key component of our system is a representation of terminology 
that takes advantage of the compositional nature of natural language 
semantics by converting natural language phrases into consistently 
structured terms. This representation overcomes issues of syntactic 
variation by normalizing different syntactic structures based on their 
compositional semantics. That is, we represent synonymous phrases in the 
same way despite differences in surface realization. To help ensure 
consistent terminology generation, our system uses a set of rules to restrict 
and guide the formation of these normalized terms. Because this system is 
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based on a modular set of rules that combine smaller components of phrases 
into standardized terms, we refer to it as a root- and rule-based method 
for terminology generation. 


Our root- and rule-based approach is motivated more by linguistic 
than statistical models. This is necessary for the construction of rule-based 
terms, which are dependent on consistent structures and a representation of 
the underlying linguistic form. Our rule-based model ultimately relies on 
the way that words come together syntactically in order to form phrases. 
The meanings of these phrases are, in most cases, related to the meanings 
of their component parts (i.e. the individual words). The way that words 
compose to form more complex meanings is detailed in research such as 
Montague (1988), though the underlying principles ultimately extend back 
to Frege (1884). Through an understanding of syntax as modeled in 
linguistics, it is possible to formalize the compositionality and therefore 
normalize synonymous phrases, despite significant differences in form. 


Key Phrases 


Key Phrase Extractor 


Tethered Root Generator Super Root Generator 


Term Generator 


Figure 1. Terminology Generation 


Within the context of our root- and rule-based system, a term is a 
representation of a concept within a domain (and may cover a number of 
words and/or phrases); a collection of terms describing the same domain 
make up a terminology. Our system also defines roots, which are smaller 
components which come together to make up terms — that is, a single term 
is made up of one or more roots. A terminology is distinct from an ontology 
in that an ontology additionally defines the relationships between concepts, 
though the concepts in an ontology are typically represented by terms of 
some sort. A rule is a codified process used to generate, restrict, or 
normalize terms in our root- and rule-based approach. This paper will 
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describe the linguistic and theoretical motivations behind these rules, which 
are introduced in Bhat ef al. (2015) as specifically applied to materials 
science. In Section 2, we describe the theory behind the syntactic 
normalization that allows for the automatic construction of root- and rule- 
based terms. This is followed by Section 3, where we describe how to use 
features of natural language syntax in combination with additional rules to 
create root- and rule-based terms. We then describe how terms can be 
extracted from natural language texts in Section 4. One of the main features 
of the root- and rule-based approach is that it is easily extensible and 
adaptable to new and different use-cases, as we discuss in Section 5. Lastly, 
we describe how our system ties in with the challenge described above of 
creating terminologies that are robust to the complexities of natural 
language and to the needs of users in Section 6. 


1.2 Use-Cases and Architecture 


In developing this strategy, we have considered four very general 
use-cases, each of which corresponds with an interface that allows users and 
administrators to interact with the terminology generation system. These 
will be discussed in greater detail in Section 6. We also discuss ontology 
generation as an extension of terminologies generated from the root- and 
rule-based method. 


* Document Entry: Users should be able to upload documents, have 
terms extracted from these documents, make changes to the suggested 
terms, and have the terms added to the terminology in the database. 


5 Document Retrieval: Users should be able to construct a query and 
receive a list of documents matching the terms in the query. 


5 Curation: Curators (who may be dedicated administrators or user 
volunteers in crowd-sourced systems) should be able to make changes 
to the terminology. 


° Rule Changes: Curators should be able to make changes to the way 
the system generates terms, as technologies change. Changes also 
reflect the way that people are using the terminology and systems that 
have become de facto standards. 


. Ontology Generation: Users should be able to use a set of terms as 
the basis for a domain ontology, which extends a terminology by 
providing additional semantic relationships between terms. 


Winter 2018 


36 


With these in mind, we have outlined a root- and rule-based terminology 
system in Figure 1. This figure shows the general process which will be 
explained in this paper. Through this process, a set of rules are used to 
extract a series of key phrases from a corpus and convert them into root- 
based terms (three types of roots are described in this paper: roots, tethered 
roots, and super-roots; see Section 3). 


2. Theory 


Generating a terminology requires identifying salient concepts in a 
corpus and constructing representations of those concepts. There are many 
different approaches to identifying salient concepts, that is, to find words 
and/or phrases in the text that stand out. In many cases, the identification of 
salient concepts produces a list of words and phrases taken directly from the 
text. A terminology generation system then needs to convert words and 
phrases into a format which enhances the potential for humans and 
machines to use the terminology for various practical applications. As with 
key phrase extraction, researchers and developers have used a variety of 
methods to convert words and phrases into terms representative of key 
concepts (Witschel 2005). 


2.1 Previous Research 


Key phrase selection is a major area of study in the field of 
information extraction (Witschel 2005). Many methods of key phrase 
extraction rely on two components: a unithood metric and a termhood 
metric. A unithood metric determines the particular types of words and 
phrases that qualify as key phrase candidates. Units found by a unithood 
metric may or may not be relevant enough to qualify as key phrases, but can 
be used to restrict the set of words that must be compared for relevance. For 
example, a unithood metric may identify all noun phrases in a corpus, so 
that only noun phrases are considered as potential terms. Not all of the noun 
phrases selected will become terms, but only noun phrases will be extracted. 
More complex unithood metrics are also possible. Frantzi, Ananiadou, and 
Tsujii (1998), for example, consider the following unithood metrics in the 
evaluation of their C-Value and NC- Value algorithms, which are algorithms 
for key phrase extraction’: 


* Tn this representation Noun, Adj, and Prep are patterns matching parts of speech (noun, 
adjective, and preposition, respectively). Parentheses group patterns together, and two 
patterns separated by a pipe (j) produce a new pattern which matches either of its 
components. The asterisk (*) produces a pattern that matches the previous expression 
zero or more times. The plus sign (+) is similar, but matches the previous expression 
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. Noun’ Noun 
. (Adj|Noun)’ Noun 
° ((Adj/Noun)'|((AdjjNoun)‘(Noun Prep)’)(Adj/Noun)")Noun 


A typical automatic term recognition algorithm may then identify 
which of the selected units are relevant using a termhood metric. A 
termhood metric may be based on statistical or linguistic features; Frantzi, 
Ananiadou, and Tsujii (1998) use word frequency to identify nested terms 
(candidates which occur within other candidates) and the surrounding 
context of a term. These are used together with a mathematical formula to 
assign a score to each candidate. In this way, they are able to select for 
particular types of terms. The features used by Frantzi, Ananiadou, and 
Tsujii (1998) are by no means exhaustive; Proux ef al. (1998) and 
Rindflesch, Hunter, and Aronson (1999), for example, make use of 
linguistic information such as part-of-speech tags to improve their termhood 
metric. 


We do not present any new methods for automatic term recognition, 
nor do we make any judgment as to the “best” contemporary method. 
However, because the model of term generation and normalization that we 
describe is dependent on automatic term recognition, we do discuss it 
briefly. In theory, our system can be used with any automatic term 
recognition algorithm, though multi-word terms, such as those recognized 
by the C/NC-Value Algorithm (Frantzi, Ananiadou, and Tsujii 1998) are 
optimal for the root- and rule-based method as hierarchical relationships can 
be inferred from these complex terms. 


Once terms have been selected, we generate normalized concepts 
rather than using natural language phrases. Natural language phrases have 
many disadvantages, including ambiguity and synonymy, as discussed 
previously. One method of normalizing concepts is to automatically group 
words into clusters, as in Liu et al. (2012). For example, the phrases 
“monthly expense,” “personal insurance product,” “core product,” 
“voluntary benefit,” and “personal insurance” may all be clustered to form 
a single concept representation related to insurance. In this way, word 
choice among different authors and contexts is normalized — words whose 
appearance is positively correlated are grouped together. The disadvantage 


one or more times, while the question mark (?) matches the previous expression zero or 
one times. 
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of clusters is that they are difficult to label — other than appearing as the set 
of words in a given cluster, they are not human-readable. 


Other approaches to normalization may involve various other 
statistical and natural-language processing techniques, such as Park, Byrd, 
and Boguraev (2002), who use a combination of stop word removal, 
lemmatization (normalization of different forms of a word, .e.g, people and 
person or colors and color), and abbreviation detection. There are various 
components of a key phrase that may not be desirable in terminology 
generation. Inflectional morphology — grammatical affixes such as -s and - 
ing provide grammatical information within the context of a natural 
language sentence, does not usually differentiate between technical terms. 
A terminology should not usually extract both heat capacity and heat 
capacities — these are probably both instances of the same term. Certain 
functional words, including articles such as a, an, and the may also be 
unhelpful. These issues can be dealt with through lemmatization and stop 
word removal, both of which are well-known problems with many proposed 
solutions in the field of natural language processing (Park, Byrd, and 
Boguraev 2002). However, even assuming that we can lemmatize phrases 
and remove stop words, there may still be undesirable redundancies in an 
automatically generated terminology, as there are many ways to express the 
same thing by using different syntactic structures. 


Consider, for example, the syntactically similar phrases the red tree 
leaves and the red leaves of the tree. In many contexts, these phrases have 
approximately the same meaning — they both refer to leaves which belong 
or grow on a tree and are red in color. When dealing with phrases such as 
these, a terminology extraction algorithm should be able to reduce 
redundancy by converting both of these terms into a normalized syntactic 
pattern. A great deal of theoretical and applied linguistic and computational 
research has been done regarding the determination of a sentence or 
phrase’s syntactic structure. By applying dependency parsing to 
terminology extraction, it becomes possible to normalize the syntactic 
structure of phrases. While many other terminology generation algorithms 
work with phrases, one of the main benefits of our approach is the ability to 
normalize surface-level differences in the structure of these phrases. 


2.2 Dependency Grammar 


As mentioned above, we use dependency parsing to normalize the 
syntactic structure of phrases. A dependency parser is a tool for syntactic 
analysis that produces a dependency tree, which is defined in terms of the 
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relationship between a phrasal head and the rest of the phrase (Tesniére 
1959). The phrasal head carries the syntactic category of the phrase — e.g., 
a noun phrase is headed by a noun, a verb phrase by a verb, etc. 
Furthermore, a dependency represents some semantic information, as the 
head of a phrase is typically specified by the rest of the phrase. For example, 
the phrase the tree’s red leaves (a noun phrase) is headed by the word 
leaves. The word /eaves on its own is quite general — it could refer to the 
leaves of any plant whatsoever. However, because of the rest of the phrase, 
we know that the leaves in question are red and that they belong to or come 
from a tree. 


Dependency syntax can be represented using a tree structure such as 
that in Figure 2. Modern parsing technologies, such as the Stanford Parser 
(Manning et al. 2014) and MaltParser (Hall 2006) are capable of 
automatically generating dependency trees from text. With the help of these 
tools we can take advantage of the semantic information provided by 
syntactic structures and normalize the syntax of phrases to generate terms. 


leaves 


ne ae 


tree’s red 


Ff 


the 
Figure 2: Dependency representation of the tree’s red leaves 
3. Representing Root- and Rule-Based Terms 


3.1 Guidelines for Terminology Representation 


Because the goal of providing a domain terminology is to produce a 
list of formal concepts in the domain, it is necessary that generated terms be 
both unambiguous and relevant to the domain. If terms are ambiguous, then 
the terminology will be inaccurate. If terms are irrelevant, even if their 
semantics are correctly represented, the terminology will not be of any 
practical use. We have defined the following criteria to describe useful 
terms for domain terminologies, based on rules 1 to 10in (Bhat et al. 2015) 
(reproduced in Section 7) These criteria should generally apply to any 
terminology generation schema. 
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° Terms should be human-readable and machine-friendly. All terms 
should be based on natural language. 


° The same term representation should always identify the same 
concept within a terminology; similarly, two terms with different 
representations should represent different concepts (see Bhat ef al. 
(2015); rules5, 7, ands). 


’ The meaning and form of a term should be predictable from smaller 
parts. This predictability must be applicable to both humans and 
machines (rules 8, 9, and 10). 


’ The form of a term should be predictable — that is, given a particular 
meaning, it should be possible to derive a compositional name for a 
term with that meaning (rules 2, 3, and 6). 


° Given a term’s compositionality, both humans and machines should 
be able to identify semantic relationships. 

’ Terms should be intuitive enough that both humans and machines can 
identify existing semantic relationships between them. 

° Terms should be mutable enough that new terms with related 
semantics can be generated (rules 3, 4, and 8) 

° Only terms representing discriminating concepts should be generated 
(rules 2 and 8). 

° Terms representing highly specific instances and individuals should 


generally not be generated; terms should be reusable in many use- 
cases (rule 9). 


Generally speaking, terms will represent a hierarchy, with some 
concepts being more specific than others. Thus, there is no concrete 
definition for “too general” or “too specific” as applied to a terminology; 
we are simply looking for terms that are useful at the level of specification 
provided by the domain, keeping in mind that terms should be usable for 
data representation, sharing, and analysis. The domain terminology should 
be representative of the input corpus. 


3.3 Normalized Dependency Trees 


The term representation that we have developed takes advantage of 
common features of natural language to create a human-readable schema 
that maintains the stipulations given above (though building more complex 
structures, such as ontologies, is also dependent on other compounds, such 
as the term selection strategies discussed in Section 4) and Section 8. 
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The theoretical framework for our term representation is the 
syntactic structure of phrases and dependency grammar, as discussed in 
subsection 2.2. Though dependency trees are not easily human-readable, 
they can be converted into human-readable forms, due to one important 
feature: given a dependency parse, it is not the order of the nodes which 
determines the meaning of the phrase — all isomorphic trees have the same 
meaning. The hierarchical structure of a dependency tree represents all of 
the semantic information that our strategy relies on; the linear order of the 
daughter nodes represents only the surface ordering of morphemes and does 
not on its own contain any semantic information. That is, the trees shown in 
Figure 3 and Figure 4 are semantically very similar, despite differences in 
node order. Automatic dependency parsers, such as MaltParser (Hall 2006), 
will produce similar trees, though we have filtered function words from 
these in order to improve normalization. 


leaf 


i ties 


tree red 


Figure 3: Collapsed representation of the leaves of the tree that are red 
leaf 


da 


red tree 


Figure 4: Collapsed representation of the red leaves of the tree 


Because of this fact we can re-order the nodes in a set of trees such 
that all trees follow the same pattern. In this way trees that are different, but 
represent the same concept, will generally be normalized to the same 
structure. For example, if both Figure 3 and Figure 4 were changed such 
that all nodes were exclusively left-branching (1.e. in a form such that the 
daughters of a node all appear to the left of the node), they would be 
identical. This creates a normalized dependency tree that we use to create 
normalized representations of terminology. 


Creating these normalized representations from key phrases is a 
three-step process: first, a dependency parser creates a dependency tree for 
each input phrase. Second, a filter removes all function words and other stop 
words such as prepositions from the dependency trees. Third, each 
dependency tree is made entirely left-branching. 
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Though these normalized trees are good representations of phrasal 
semantics, they are not easily understood by human users of a terminology. 
Understanding these trees requires an understanding of dependency 
grammar, a potentially non-intuitive concept. Instead, our strategy converts 
these normalized trees into human-readable forms using a number of 
concrete rules. These new representations are linear and can be stored as 
simple strings of characters. 


For example the trees in Figure 3 and Figure 4 can be linearized by 
first converting the trees into normalized form (Figure 5). Once we have 
normalized the syntax, we can convert the structure into a linear format, 
such aS RED-TREE_LEAF. This format contains the same information as the 
tree structure and is easily interpreted by English speakers and by 
computers. For English speakers the linear form corresponds to a standard 
English phrase with the same meaning (the red tree leaf). For computers the 
underscore (“_’) indicates that the two final roots (tree and /eaf) compose 
first, followed by red, which is equivalent to the structural information 
shown in the tree. We have not yet discussed exactly how this linear format 
is reached from the dependency tree; this, including the semantics of the 
hyphen and underscore delimiters, will be explained in the following 
section. 


leaf 


nah 


red_ tree 
Figure 5: Normalized tree for RED-TREE_LEAF 


3.2.2 Roots and Terms 


Just as the above structural representations depend on the principle 
of compositionality (Frege 1884) and the formulations of compositional 
semantics (Montague 1988), the linear term representation that we have 
developed facilitates breaking terms apart into smaller components. Unlike 
in natural language, we primarily use compounding, combining individual 
“words” into larger terms. The individual meaningful components of a term 
are called roots, and cannot usually be broken down into further meaningful 
parts. A root should correspond to a single meaningful word such as tree, 
electron, or computational. Roots can be combined in various ways to create 
structured terms, which are semantically complex structures whose 
meaning can be easily determined from their component parts. This is based 
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on rules 1, 2, and 3, as well as the specialized terminology in Bhat ef al. 
(2073): 


Roots can combine in different ways; each method of combination 
is represented textually by a unique delimiter, including the underscore 
(‘_’), the hyphen (‘-’), and the colon (‘:’). This delimiter notation is 
extensible and replaceable (completely different sets of delimiters may be 
used); new delimiters can be added to express new relationships between 
roots. The hyphen is a general delimiter used for combining roots into terms. 
Depending on the use-case of the terminology, the hyphen may have a 
slightly different meaning, but it should usually be used to add specificity 
to a root through the addition of a second root. For example, the term TREE- 
LEAF 1s composed of the roots TREE and LEAF. The hyphen indicates that the 
root /eaf (a very general term) is made more specific through the addition 
of the root TREE, which indicates that the term as a whole represents a 
specific type of LEAF — namely, the leaf of a tree, rather than the leaf of a 
bush or other plant. Forms such as this can be easily derived from syntactic 
structure: both the construction of terms from roots and the structure of 
syntactic dependency indicate the modifier-head relationship between two 
roots. In dependency structures, the dependencies of a root are its modifiers, 
just as a root (the head) is modified by the preceding root. 


The interpretation of terms with three or more roots could be 
ambiguous. However, we impose a left-branching dependency syntax on all 
terms, meaning that the roots in a term compose from left to right. For 
example the term OAK-TREE-LEAF refers to the leaf of an OAK-TREE, and 
OAK-TREE refers to a tree of type oak. The first semantic operation is the 
composition of OAK and TREE. This is followed by the composition of OAK- 
TREE (as a single concept) and LEAF. 


Combining roots with more complex structures requires additional 
delimiters. For example, as described above, the term RED-TREE-LEAF refers 
to the leaf of a red tree, not to the red leaf of a tree — that is, in this term, the 
tree is red but the leaf is not. The English phrase “red tree leaf” is ambiguous 
in a way that the term RED-TREE_LEAF is not. This is why, if we are trying 
to create a term with the interpretation the red leaf of a tree (the dependency 
structure in Figure 5), it cannot be represented using hyphens alone. The 
second delimiter, the underscore (‘_’) has higher precedence in the 
compositional order of operations — composition of roots delimited by 
underscores occurs before the composition of roots delimited by hyphens. 
The latter has lower precedence in decomposition. Two roots delimited by 
an underscore are called “super-roots” and allow for terms that cannot be 
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expressed solely by hyphens. For example, “the red leaf of a tree” (Figure 
5) can be represented as the term RED-TREE_ LEAF. TREE_LEAF 1s a super- 
root and thus combines first: the term represents the leaf of a tree. The super- 
root then combines with RED, specifying that the TREE-LEAF is of the color 
red. 


We define two additional methods for combining roots, though more 
methods can be added based on use-case (see Figure 5). The first is to 
combine the roots without any delimiter at all, referred to as creating a 
“tethered root.” For example, if TREE and LEAF were to be combined into a 
tethered root, they would become TREELEAF. The purpose of tethered roots 
is to create roots that are composed of multiple words in English, but which 
have meaning only as a whole phrase. If the components of a tethered root 
are represented as roots in a term, the meaning of the term will not follow 
from its component parts. Set phrases such as “gray area” (referring to a 
situation that does not easily fit into preexisting categories) are strong 
candidates for tethered roots. More generally, tethered roots are useful when 
the component parts are not usable the same way in other terms. For 
example, if GRAY-AREA is treated as a term, there should be other terms of 
the form GRAY-X where “gray” has the same meaning as in GRAY-AREA. 
However, because this is not possible, it is preferable to create a tethered 
root: GRAYAREA. 


The last delimiter we describe in this paper is the colon (‘:’). Two or 
more terms can be combined into compounded terms with this delimiter. 
A compounded term represents a high-level semantic cluster, though the 
usage of these terms is dependent on the use-case. While most terms 
represent concepts, compounded terms can also represent relationships 
between those concepts. For example, because APPLE-TREE and 
RASPBERRY-PLANT both refer to plants that bear fruit, the compounded term 
APPLE-TREE:RASPBERRY-PLANT could represent this fact about the two 
component terms. The exact type of relationship is not specified by the 
compounded term, which describes only a very general semantic connection 
between its components. 


All of these combinations can be generated automatically from a list 
of key phrases using dependency parsing and a training set. Roots are 
usually equivalent to nodes in a dependency structure. Because of this, they 
can be combined into super-roots and into terms by examining an 
automatically generated dependency tree, removing unimportant words, 
and performing automatic lemmatization. Creating tethered roots and 
compounded terms cannot be done with dependency trees alone, and 
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requires the use of a training set. Tethered roots are formed when splitting 
the term does not provide any reusable information. This can be measured 
using statistics such as term frequency-inverse document frequency (a 
measure of a words importance in a document relative to its overall 
frequency) (Wu et al. 2008). Compounded terms can be identified using 
measurements of co-occurrence frequency, which identify semantic 
relationships between terms (Kostoff 1993). Because terms are 
unambiguous, and different relationships between roots are represented by 
different delimiters, a machine can also easily break down a term into its 
component parts, just as it can build up a term based on the relationships 
between the components (Table 1). 


Table 1. Summary of term syntax 


Root Type (delimiter) Description Example 

Term (-) Composite Concept OAK-TREE 

Root Single Concept TREE 

Super-Root High-precedence Composite TREE LEA 

Tethered Root Multi-word Single Concept GRAYAREA 

Compounded Term (:) Two related terms APPLE- 
TREE:RASPBERRY- 
PLANT 


4. Key Phrase Extraction for Root- and Rule-Based Terms 


In Sections 2, 3, and 4 we have discussed how to generate structured 
terms using a root- and rule-based approach taking advantage of syntactic, 
semantic, and statistical cues. Though structured terms are useful 
representations of concepts in a terminology, and though natural language 
processing and other statistical tools can convert key natural language 
phrases into structured terms, we have not yet discussed how these key 
phrases are selected. It is possible, of course, to manually select key phrases 
to be used in a terminology. In some cases, this is unavoidable, as there will 
always be disagreement as to what constitutes an important term within a 
domain, but it is helpful if at least some of the work can be done with an 
automated system. The automated system may be helpful in providing an 
empirical basis for coming to agreement. 
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The study of automatic terminology extraction is a major area in the 
field of information extraction (Witschel 2005). Most methods of 
terminology extraction rely on two components: a unithood metric and a 
termhood metric. A unithood metric determines the particular types of 
words and phrases that make potential candidates for terms. A unit is not 
necessarily the final representation of a term, nor are all units relevant 
enough to be treated as terms. For example, a unithood metric might 
consider all noun phrases (such as “the red leaf of a tree”) in a corpus to be 
valid term candidates. The task of a termhood metric is to determine which 
of the candidate terms are important enough within a document to be a part 
of the terminology. Together, a unithood metric and a termhood metric can 
extract all of the salient words and/or phrases from a document. 


Most terminology extraction methods combine statistical and formal 
methods. Unithood metrics are often based partially on linguistic features 
such as part of speech. For example, it is uncommon to include isolated 
prepositional phrases in a unithood metric (though prepositional phrases 
that are included in other phrasal categories may be included). Termhood 
metrics usually analyze the frequency of a term in a document with respect 
to its frequency in a collection of documents in order to determine the extent 
to which the term represents the content of the document. However, 
termhood metrics can also take into account linguistic features; for example, 
nouns with Greek or Latin endings such as /itis or /scopy may be more likely 
to be technical terms in certain domains (Witschel 2005). 


The proposed methods of syntactic analysis described in Section 2 
can take as input a series of phrases extracted from a document or corpus 
and convert them into structured terms. However, this algorithm is sensitive 
to the terminology extraction techniques used, as it is dependent on the 
interface between syntax and semantics. If the input phrases are too short to 
provide semantic clarity, the output terminology will be too general. If the 
input phrases are too long (with respect to the number of words), the output 
terminology will be too specific. Many terminology extraction algorithms 
only extract single morphemes or words — that is, they output terms such as 
“solar” or “photovoltaic” (Witschel 2005). However, our proposed system 
prefers units with two to five content words, as longer or shorter terms will 
tend to be either too general or too specific for most use-cases. Longer terms 
are more specific, and will often introduce nuances that are not necessary in 
domain terminologies. 


Bearing these restrictions in mind, there are still terminology 
extraction algorithms that cater to the needs of structured terminologies. 
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Methods such as the NC-Value Algorithm (Frantzi, Ananiadou, and Tsujii 
1998) are designed for extracting multi-word terms and their algorithm can 
easily be extended to favor two- to five-word phrases in order to generate 
the most effective structured terms. 


Though our proposed system is sensitive to terminology extraction, 
the exact algorithm used is an implementation detail that can be changed 
easily, as discussed in Section 5. Different use cases may choose different 
terminology extraction algorithms, depending on their needs. The root- and 
rule-based approach that we describe is not specific to any terminology 
extraction algorithm, and the exact method can be customized according to 
the use case. 


The root- and rule-based approach we propose also provides a 
significant advantage for terminology extraction. Because root-based terms 
can easily be broken down into their component pieces, it is possible to 
compare two terms and find similarities between them. Because of this, it 
is possible to use previously generated normalized terms as hints for term 
selection. For example, given that the term RED-TREE_LEAF 1s salient in a 
corpus, the terms RED-TREE BRANCH, GREEN-TREE LEAF, and RED- 
BUSH_LEAF are probably salient as well, as they share much of the same 
information. 


5. Extensibility 


The previous sections describe the various methods that go into 
terminology generation. The major components are salient phrase 
extraction (Section 4) and converting key phrases to structured terms 
(Section 2 and Section 3). However, using this model in a complete system 
is more complex. 


One of the primary benefits of this root- and rule-based approach is 
the compositional form of terms. Based on this approach, it is possible to 
build an extensible and modular system that can be adjusted to suit different 
needs. In this section we describe how the system as a whole can be 
configured through different modules and extensions, and how this is 
enabled by the rule-based model. 


Figure 1 shows the various processes involved in generating terms 
from a corpus of documents. These processes interact with three different 
types of data: the corpus, the set of key phrases, and the terminology (stored 
in the database). The corpus may be either a large set of documents used to 
initialize the terminology, or a smaller set of documents introduced through 
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a user interface, as discussed in Section 6. The set of key phrases are the 
salient phrases extracted from the corpus; key phrases are natural language 
phrases that have not yet been processed by the structuring methods 
described in Sections 3 and 4. The terminology consists of any pre- 
generated terms, which can be used as a training set. During the term 
generation process, new terms are added to the existing terminology. 


Working with these data requires many different tools and sub- 
processes - key phrase extraction, tethered root generation, super root 
generation, lemmatization, and term generation. A key feature of the design 
is that these tools are not necessarily co-dependent, and can thus easily be 
substituted depending on the needs of users or on advancements in the 
technologies that constitute each component. 


5.1 Key Phrase Extractor 


The key phrase extractor has the task of extracting salient phrases 
from the corpus. As discussed in Section 4, there are many different 
algorithms that can handle this task. As such, it is possible to entirely replace 
the key phrase extractor that is used in a root- and rule-based terminology 
generation system. 


The key phrase extractor may also take advantage of any terms 
already in the term database by treating them as a training set. Different 
terminology extraction algorithms may use this training data in different 
ways. For example, some applications may only wish to include very close 
matches with preexisting terms, while others may choose to be more liberal 
with key phrase extraction. This allows different use-cases to use the 
preexisting terms as appropriate. 


5.2 Tethered Root Generator 


Tethered roots (see Section 3) may be generated based on two major 
components of the terminology generation system: the set of key phrases, 
and the set of preexisting terms. A tethered root generator may use statistical 
models to determine the information content of a given root relative to the 
set of all key phrases, or it may use other tethered roots in preexisting terms 
as cues to generate tethered roots from the current corpus. Again, this 
component can be customized according to the use-case. One possibility is 
using Shannon entropy (Shannon 1948) to identify sequences that add less 
information to a dataset when split up into multiple roots than they would if 
a tethered root were used instead. For example, if OAK-TREE-LEAF provides 
less information than OAKTREE-LEAF, then OAKTREE-LEAF could be used 
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instead. Ideally, a tethered root generator will consider not only how much 
information is contained in each variant, but also whether the information 
is misleading or inconsistent. 


5.3 Super Root Generator 


Super root generation requires much of the same information as 
tethered root generation, except that roots are more likely to make sense 
when considered separately. Super root generation may also take advantage 
of a dependency parser in order to determine super roots based on syntactic 
Structure. 


5.4 Lemmatizer 


In order to avoid creating unnecessary terms, roots are lemmatized 
to avoid codifying differences in grammatical form. There are many 
different ways that words can be lemmatized, and lemmatization is a non- 
trivial task in computational linguistics (Sharma 2010). One common 
method is to use lexical databases such as WordNet (Fellbaum 1998), as in 
our forthcoming reference implementation of root- and rule-based 
terminology generation. 


5.5 Term Generator 


A term generator combines the roots that make up each key phrase 
into a structured term based on the results of a dependency parser and the 
methods described in Section 2 and Section 3. The rules in the rule-based 
system we describe are not static and can be changed by users, 
administrators, or developers when needed to improve the system’s 
performance at its given task. 


Altogether, these tools and sub-processes come together to form a 
model of terminology generation that is customizable, takes advantage of 
both linguistic and statistical facts, and is at the same time both machine- 
and user-accessible. 


5.6 Example 


An example of the entire terminology generation process is shown 
below, to illustrate how these components come together. This example 
begins with the following document, taken from Overton Jr and Gaffney 
(1955) (https://materialsdata.nist.gov/dspace/xmlui/handle/1 1256/79), 
from which terms will be extracted. 
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The ultrasonic pulse technique has been used in conjunction with a 
specially devised cryogenic technique to measure the velocities of 
10-Mc/sec acoustic waves in copper single crystals in the range from 
4.2K to 300K. The values and the temperature variations of the 
elastic constants have been determined. The room temperature 
elastic constants were found to agree well with those of other 
experimental works. Fuchs’ theoretical c44 at OK is 10 percent 
larger than our observed value but his theoretical cll, cl2, K and 
(cll—cl2) agree well with the observations. The isotropy, (cl1— 
c12)2c44, was observed to remain practically constant from 4.2K to 
180K, then to diminish gradually at higher temperatures. Some 
general features of the temperature variations of elastic constants are 
discussed. 


A key phrase extractor then determines the most salient phrases in the 
corpus and extracts them. Key phrase are shown in bold and underlined. 


The ultrasonic pulse technique has been used in conjunction with 
a specially devised cryogenic technique to measure the velocities of 
10-Mc/sec acoustic waves in copper single crystals in the range 
from 4.2K to 300K. The values and the temperature variations of 
the elastic constants have been determined. The room temperature 
elastic constants were found to agree well with those of other 
experimental works. Fuchs’ theoretical c44 at OK is 10 percent 
larger than our observed value but his theoretical cll, c12, K and 
(cll—cl2) agree well with the observations. The isotropy, (cl 1— 
c12)2c44, was observed to remain practically constant from 4.2K to 
180K, then to diminish gradually at higher temperatures. Some 
general features of the temperature variations of elastic constants are 
discussed. 


Once these key phrases have been extracted, they need to be 


converted into normalized root- and rule-based terms. This involves 
parsing, lemmatizing, and structuring the phrases according to the methods 
discussed in Sections 2 and 3. 


For example the phrase temperature variations of the elastic 


constants should be parsed and converted into the following dependency 
tree (Figure 6): | 
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elastic 
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the 


Figure 6: Dependency representation of temperature variations of the 
elastic constants 


The words are then lemmatized and the syntactic structure 
normalized such that the tree is entirely left-branching. At this time 
unimportant function words such as of are also removed. This results in 
Figure 7 


variations 


a 
ae 
ee 
ene 


temperature oot 
Pg 
say 


constants 


Fo 


elastic 


/ 


the 


Figure 7: Normalized representation of temperature variations of the 
elastic constants 


Based on this structure, the system should generate the term 
ELASTIC-CONSTANT-TEMPERATURE_ VARIATION. Because variation has two 
branches in Figure 6, one of them must be used to generate a super-root in 
order to preserve unambiguity. This yields TEMPERATURE_VARIATION 
which can then compose normally to create the complete term ELASTIC- 
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CONSTANT-TEMPERATURE VARIATION. The other key phrases can be put 
through this same process, yielding ULTRASONIC-PULSE-TECHNIQUE, 
COPPER-SINGLE CRYSTAL, ELASTIC-CONSTANT-TEMPERATURE_VARIATION, 
and ROOM-TEMPERATURE-ELASTIC_CONSTANT, which are inserted into the 
database as valid terms. Some discrepancies may result from this strategy; 
for example, the above terms include both the structures ELASTIC-CONSTANT 
and ELASTIC_CONSTANT. Such ambiguities are resolved through the use of 
a training set or manual curation. For example if ELASTIC_CONSTANT is 
found in the training set, the system can resolve this conflict. 


5.7 Performance and Usage of Root- and Rule-Based Terms 


The previous sections demonstrate how root- and rule-based terms 
can be constructed using phrases extracted from natural language texts. 
However, we have yet to analyze the performance of root- and rule-based 
terms as data structures. One of the major advantages of these structures is 
that they are capable of being constructed automatically using linguistic and 
statistical methods, but the structure of the terms themselves provides 
additional performance and usability gains for many tasks. 


The example given in Section 5.6 shows how a small set of root- and 
rule-based terms can be generated from a single document. If this process 
is repeated over a larger sample of documents, the result is a large 
terminology representing concepts in a particular domain. In order to 
analyze this terminology, we examine the data that are represented by each 
term. 


To begin with, root- and rule-based terms are typically human- 
readable and understandable. The terms COPPER-SINGLE_ CRYSTAL and 
ULTRASONIC-PULSE-TECHNIQUE are fairly simple to understand, assuming 
that the component parts are understood. Even without knowing what 
ultrasonic means, it is still possible to gather that ULTRASONIC-PULSE- 
TECHNIQUE refers to a technique involving ultrasonic pulses. To a human, 
root- and rule-based terms may not always seem unambiguous - 
ELASTIC_CONSTANT-TEMPERATURE_ VARIATION may initially be perceived 
as referring to something of constant temperature rather than to the 
temperature variations of the elastic constant. However, the correct 
interpretation is typically apparent in English-language terms, as English 
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tends to follow a head-final structure’ (which is imposed on all root- and 
rule-based terms). 


From a computational perspective, root- and rule-based terms are 
syntactically unambiguous. The term ELASTIC_CONSTANT- 
TEMPERATURE_VARIATION refers specifically to the temperature variation 
of the elastic constant; it cannot refer to elastic variations of constant 
temperature or to any other alternative referents. This has several 
implications for root- and rule-based terms. Firstly, two root- and rule-based 
terms which contain the same roots but have different structure can be used. 
The term ROOM-TEMPERATURE-ELASTIC CONSTANT is distinct from the 
term ROOM-temperature-ELASTIC-CONSTANT - the former refers to the 
elastic constant at room temperature, while the latter refers to a constant 
relating to room temperature elastic (e.g. elastic material held at room 
temperature). Because these two terms have distinct meanings and distinct 
structures, both can be represented if necessary, and both can be generated 
from natural language. Secondly, the structure of root- and rule-based terms 
implies a larger semantic structure. 


The two terms ROOM-TEMPERATURE-ELASTIC_CONSTANT and 
ABSOLUTEZERO-ELASTIC_CONSTANT both contain the — super-root 
ELASTIC CONSTANT and both refer to types of elastic constant, i.e. the 
elastic constants at room temperature and the elastic constants at absolute 
zero. However, other terms, such as ELASTIC_CONSTANT- 
TEMPERATURE_ VARIATION also contain ELASTIC_CONSTANT but do not refer 
to types of elastic constant. Instead, they refer to a type of temperature 
variation. This can be determined from the structure of root- and rule-based 
terms, allowing a computer to generate term hierarchies and semantic maps. 


The structural and hierarchical nature of root- and rule-based terms 
also allow for more powerful searches and analyses of data. Documents can 
be indexed not just by the words that occur in them, but by salient concepts 
they discuss. This allows a document describing ROOM-TEMPERATURE- 
ELASTIC_CONSTANT to be distinguished from one describing ROOM- 
TEMPERATURE-ELASTIC-CONSTANT - even though the roots that make up 
these terms are the same, the two documents are describing different 
concepts. Thus, a user searching in a collection of documents can specify 
whether they are hoping to find documents on ROOM-TEMPERATURE- 
ELASTIC_CONSTANT or _ on ROOM-TEMPERATURE-ELASTIC-CONSTANT. 


3 In a head-final structure, the head phrase follows its dependents. For example, the noun 
in a noun phrase follows the adjectives which modify it. 
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Furthermore, because ROOM-TEMPERATURE-ELASTIC_CONSTANT is more 
specific than just ELASTIC_CONSTANT, a user may also be able to search for 
more general concepts as well. Similarly, given a general term such as 
ELASTIC CONSTANT, a system can determine more specific related terms 
such as ROOM-TEMPERATURE-ELASTIC_CONSTANT and ABSOLUTEZERO- 
ELASTIC_CONSTANT. 


6. Applying Root- and Rule-Based Terminologies 


In previous sections we have discussed how we propose to build 
domain terminologies using a root- and rule-based approach. We have 
described how an algorithm can convert phrases of natural language into 
structured terms, how key phrases can be extracted, and how this system 
can be extended and modularized. In this section, we discuss why the root- 
and rule-based approach we have proposed facilitates the creation of useful 
terminologies. 


It is non-trivial to measure the correctness of a domain terminology. 
Standard metrics such as precision (the percentage of the output answers 
that are desirable) and recall (the percentage of all desired answers that are 
actually contained in the output) may not accurately assess problems 
without definite answers. Terminologies are only desirable insofar as they 
represent sets of useful concepts relating to a domain; different uses may 
lead to different notions of desirability. A terminology does not represent 
every possible concept used in a domain — instead, it represents only those 
concepts that are of appropriate specificity for practical purposes, 
depending on the use case. 


This notion of practical validity does not have an objective definition 
that can be directly measured. Instead, the use determines the validity of 
terminology in a particular context. For example, a terminology that is based 
entirely on taxonomy (a scheme of hierarchical categorization) may be 
useful for some tasks, but for others, it may be desirable to represent 
information about the properties of terms using a different scheme of 
classification. In some cases, for example, it may be useful to know that RED 
is a type of COLOR, while in others it will instead be useful to know that a 
BALLOON has the property COLOR with value RED. The Root- and rule-based 
approach and be adapted to support new ways of classification. 


More generally, terminologies need to be available on demand to 
many types of interaction with users. For example, in addition to end-users, 
a terminology needs to be easily maintained by administrators. The 
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implementation of a terminology may be very efficient on the user end, 
obtaining the results of search queries very quickly, but be inefficient on the 
administration end. This is often a detail of implementation, but can be 
relevant to the formulation of the terminology model. 


The root-and-rule-based model that we have described is designed 
to meet the needs of a variety of general use-cases and be configurable 
enough to meet the needs of more specific situations. We have considered 
four general use-cases in our proposal: document entry, document retrieval, 
curation, and rule changes. These use cases assume a centralized database 
containing the core terminology. The relationships between these use cases 
and the data are shown in Figure 8. The terminology generator in Figure 8 
is shown is the same system shown in Figure 1. The desirable components 
of a ara system such as this are described in Section 1.2 


neal Reiievel trieval 
eta e 


User 


Document Entry | = a ‘Cusation uration 
tat -rfac Interface 
Term Database 


Curators 


—_— | 


t 
Term Generator Initial Corpus 


[File Ghana Cc ea ee 
Inte ea ee ce 


Figure 8. A strategy for use-cases to create and manage root- and rule- 
based terminologies 


6.1 Document Entry Interface 


We have proposed that a document entry interface should provide 
functionality that allows a user to upload a document to the system. The 
system should then determine the terms in the document that match pre- 
existing terms, extract any new terms from the document, and allow the user 
to edit these terms. 
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We propose that a user should be able to edit the results of document 
entry by removing those terms that do not apply to the uploaded document, 
or by adding new terms that the algorithm did not find. 


A document entry interface is dependent on the ability to quickly 
identify terms in a single document. That is, an automated system that 
allows for single document entry must be robust to corpora containing few 
documents. The terminology must be able to be built one document at a 
time, with no dependency on large corpora. Our proposed root-and rule- 
based system would take advantage of redundant methods of terminology 
extraction in order to work with different-sized corpora. In addition to 
extracting terms with a more general-purpose terminology extraction 
algorithm (see Section 4), our proposal also takes advantage of the 
structured nature of terms to find terms in new documents that are 
structurally similar to previous terms. This often allows the root- and rule- 
based method to identify useful new terms even without statistical evidence. 


6.2 Document Retrieval Interface 


In our proposed document search and retrieval interface, a user 
should be able to enter one or more terms and receive a listing of documents 
containing the given terms. In the simplest case the user simply inputs a list 
of terms, and the system locates all of the documents in the database 
containing the given terms. More complex search systems are also possible, 
allowing for additional refinement of search criteria. 


Structured terms improve the potential of search systems. The 
compositional nature of terms in this model means that users can make 
semantic searches, rather than simply searching for the presence of a 
collection unrelated words. That is, instead of searching for a document that 
contains the words “red”, “tree”, and “leaf”, a user can easily identify the 
exact concept in question and search for RED-TREE_ LEAF. This allows the 
user to make much more succinct and semantically rich search queries, 
producing narrower and more relevant result sets. 


The system by which terms are generated from natural language 
phrases can also be used to improve the usability of user search interfaces. 
Rather than requiring that users search directly for structured terms in the 
database, which requires an understanding of the way that root- and rule- 
based terms are formed, the interface can instead allow users to input 
phrases of natural language. For example, if the user inputs “the red leaves 
of a tree”, the interface can quickly generate suggested terms, such as RED- 
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TREE LEAF, meaning that users only need a passive knowledge of term 
structure in order to make advanced use of the interface, while still allowing 
for unambiguous searches. 


6.3 Curation Interface 


A curation interface allows a team of curators to make changes to 
the terminology. Curators may either be a select team of administrators, or 
volunteer end-users. That is, it is possible in some cases to crowd-source 
curation. Depending on the resources available for a particular system, it 
may be desirable to have dedicated moderators or to allow end-users to 
make their own changes to the database. We make no suggestions as to 
which is more generally preferable, but instead describe a terminology 
system that is able to handle both. 


Curation may be partially dependent on search, as discussed above, 
in subsection 6.2, as curators need to be able to locate terms to change. 
However, curators may benefit from more than a document retrieval system, 
as they should be able to examine the complete structure of the terminology. 
For example curators may wish to view taxonomic relationships between 
terms in order to ensure that the taxonomy is structured correctly. Curators 
should be able to make changes to the structure, as well as to individual 
terms in the terminology. 


A root- and rule-based terminology interacts well with this sort of 
curation interface. The semantic nature of terms can be used to determine 
the taxonomic structure of the terminology implicitly (as discussed in 
Section 2, modifiers (roots other than the head) add specificity, and 
unmodified terms are equivalent to hyponyms relative to their modified 
variants). Additional relationships between terms are represented through 
compounded terms (semantic clusters represented by terms that have been 
combined using the delimiter ‘:’), allowing for graph-based visualizations, 
such as that shown in Figure 9. A visualization such as this might be derived 
from the following two terms: ELECTRON-CYCLOTRON- 
CURRENT_DRIVE:NONINDUCTIVE-CURRENT_DRIVE and ION-CYCLOTRON- 
CURRENT_DRIVE. By parsing these terms into roots, an algorithm or a user 
can easily determine that both terms represent types of CURRENT_DRIVEs, 
and that ELECTRON-CYCLOTRON-CURRENT DRIVE and ION-CYCLOTRON- 
CURRENT DRIVE are types of CYCLOTRON-CURRENT_DRIVE. Furthermore, 
the user interface can show that there is some relationship between 
NONINDUCTIVE-CURRENT_DRIVE and ELECTRON-CYCLOTRON- 
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CURRENT DRIVE. This information is all determined from the structure of 
these three terms, and the composition of the roots. 


CURRENT DRIVE 


CYCLOTRON-CURRENT_DRIVE NONINDUCTIVE-CURRENT_DRIVE 


ELECTRON-CYCLOTRON-CURRENT_DRIVE ION-CYCLOTRON-CURRENT_ DRIVE 


Figure 9. Visualizing Term relationships 


The implicit and explicit relationships between terms in this model 
allow for changes to terms without requiring manual restructuring of the 
hierarchical structure. For example, if the system erroneously created the 
term RED-LEAF_TREE instead of RED-TREE_LEAF, a curator could make this 
change without needing to manually relocate the term to its proper 
taxonomic position beneath LEAF. Instead, the curator only needs to change 
the term RED-LEAF TREE to RED-TREE_LEAF and the fact that RED- 
TREE_LEAF Is a type of LEAF can be automatically inferred. 


6.4 Rule Change Interface 


Rule change interfaces are desirable in many evolving situations. 
Due to the constant changes in knowledge and vocabulary that occur in all 
domains, it is sometimes necessary to update the way that the rules are used 
to generate terms. This interface may be particularly useful during the 
infancy of this technology. In our proposal, it may be desirable to create 
new delimiters with new meanings, change the behavior of delimiters, or 
change the way that terms are selected. In addition, it may become feasible 
to make changes due to improvements in technologies. In our proposal, this 
process does not require a complete restructuring of the system. Instead, 
only an interface which allows for the replacement of individual 
components of the system, as described in Section 5, is necessary. 
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6.5 Generating Ontologies 


As discussed in Section 1, one of the motivations for a root- and 
rule-based method of terminology generation is practical ontology 
extraction for the semantic web as well as identifying, tracking, and 
predicting changing trends. The root- and rule-based terminologies we have 
discussed are well-adapted for use in ontologies. Though the methods we 
describe do not immediately output a complete ontology, the structured 
nature of terms in root- and rule-based terminologies means that a 
terminology can easily be extended into the skeleton of an ontology. As 
shown in Figure 9, some of the relationships between terms can be inferred 
from term structure. These relationships are primarily taxonomic, but 
compounded terms reveal additional connections that may inform the 
structure of a complete ontology. 


Some of the relationships that structured terms encode are 
underspecified. For example, the term ELECTRON-CYCLOTRON- 
CURRENT_DRIVE:NONINDUCTIVE-CURRENT_ DRIVE entails that there is a 
relationship between ELECTRON-CYCLOTRON-CURRENT DRIVE — and 
NONINDUCTIVE-CURRENT_ DRIVE. However, it does not specify what that 
relationship is. An ontology may need to supply this additional information, 
but the presence of the relationship is already known. Because of this, a 
root- and rule-based terminology provides many of the relationships 
necessary for a domain ontology. A system could generate a simple 
ontology by adding labels to these relationships and adding any relevant 
edges that may have been left out in terminology generation. 


Ontologies are usually dependent on the terminologies that inform 
them. By basing ontologies on root- and rule-based terminologies, it is 
possible to create ontologies that are human- and machine-readable. 
Because language, including the vocabulary used in most domains, is 
constantly evolving, ontologies also need to evolve. Because root- and rule- 
based terminologies are adaptable to new needs and to evolving vocabulary 
as discussed in Section 5, they are superior to ad hoc terminologies for 
constructing practical domain ontologies. 


It is possible to partially automate the creation of domain ontologies 
from terminologies. Statistical analyses such as co-word analysis (Kostoff 
1993; Coulter et al. 1996; Coulter, Monarch, and Konda 1998) can suggest 
relationships among the terms in a terminology, though not label these 
relationships. These methods associate relationships between terms which 
occur together within a fixed window. For example, if the terms ELECTRON- 
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CYCLOTRON-CURRENT_ DRIVE and NONINDUCTIVE-CURRENT DRIVE occur 
together more often than is expected based on the frequencies of the 
individual terms, then it is likely that there is a semantic connection between 
them. The above analyses do not automatically label these relationships, but 
crowd-sourced methods can help to assign labels to common relationships 
and create training sets for future automatic labeling techniques. Though 
there is currently no fully-implemented application that extends root- and 
rule-based terminologies into domain ontologies, a system that uses a root- 
and rule-based approach as well as co-word analysis and crowd-sourced 
labeling is currently in development. Moreover, the terminologies generated 
by this approach can be used with current applications, both commercial 
and freely available, to produce terminological networks much like the ones 
produced by co-word analysis using tools such as Leximancer and Gephi 
(Smith and Humphreys, 2006; Mathieu Bastian 2009). 


7. Conclusion 


Our root- and rule-based approach present several advantages for the 
development of domain-based terminologies that are not available in semi- 
structured models, while still maintaining both human- and machine- 
readability. The primary advantage of root- and rule-based terms is that they 
allow one to consistently and clearly represent important domain concepts. 
Root- and rule-based terms are compositional, allowing for the division of 
terms into their component parts for searching, selecting new terms, or 
deriving relationships between terms. 


Our proposed root- and rule-based model is also highly modular, 
meaning that different users can easily adapt and maintain terminological 
systems. Components may be updated with technological advances, the 
introduction of new uses, or adaptations based on how systems are being 
used in real-world situations. The model is also designed with many 
important use-cases in mind, including search, document uploading, and 
curation. This allows the system to be practical for the needs of different 
users, for the administration, and for developers and scientists hoping to 
expand upon a root- and rule-based system. 


The model presented here is linguistically motivated, and follows 
from many aspects of linguistic theory, including syntax, semantics, and 
pragmatics, allowing it to connect on a fundamental level with the way that 
humans actually use language, rather than with mathematical constructs 
transparent only to machines. Like language itself, this model is 
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evolutionary, use-based, and compositional, designed with practical needs 
rather than purely theoretical constructs in mind. 
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Appendix A: Rules 


The enumerated rules listed below were originally published in Bhat ef al. 
(2015). 


If 


Forming roots: 


I. 


oF 


Use all roots in singular form except where plural form is used 
more frequently. 


Avoid using special characters (such as’: _ - =/\) asa part of 
a root. 


Avoid the use of modifiers as roots. 


Use abbreviations only when they are widely accepted across 
many related disciplines and when they are unambiguous in 
their meaning. See Rule [itm:ambiguities] for exceptions when 
acronyms are embedded in a super root. Use uppercase for all 
acronyms except for atomic symbols. 


For similar expressions choose a shorter equivalent as a root. 


Forming super roots: 


A super root is formed when the roots involved do not have a 
preferred discriminating power and semantics to serve as node names 
of a data-graph or as RDF elements except in special circumstances. 


[. 


Super roots are concatenated by an underscore to indicate its 
compound semantics and its ability to be parsed into individual 
roots only under unusual conditions. Ifa super root is comprised 
of roots that are not specific when considered individually, then 
refer to tethered roots (see Rule [itm:tethered]). 


When a root of a super root functions like a hierarchical 
classifier to another root then also include the classified root in 
the super root so that automated parsers can recognize the 
hierarchy. To order roots within a super root, unless there 
already exists a well-accepted alternate convention, use rule 
[itm:ordering]. 


Forming tethered roots: 


a) Create tethered roots when a root is a qualifier of another root and 


the semantics of any root on its own may not be of interest in a 
database or data repository search. Tethered roots are formed to 
indicate that the roots involved need be considered collectively, 
rather than individually, in order to derive their semantics. For this 
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reason, roots in a tethered root are written contiguously to avoid 
inadvertent separation by automated methods. Since tethered 
roots are comprised of qualifier and qualified roots, following a 
general convention of root-based construction of English 
language words, we use their intrinsic qualifier-qualified 
relationship to order their roots. 


b) A root may appear in more than one tethered root. 


Tethered roots may also provide a way to avoid the use of stop words 
in a compounded root. That is, move the word from the right of a stop 
word to the left, drop the stop word, and place the qualifier before the 
qualified. 


Forming terms from roots: Terms are formed by concatenating two or 
more roots, super roots or tethered roots using a hyphen (-) so that 
automated methods may re-generate their roots when necessary. We 
suggest to order roots of a term by classifier-classified relationships 
(See Rule 6) which is also a general convention in English, as in 
police dog or technical paper unless there is a different well accepted 
convention. 


Avoiding ambiguities and redundancies 


a) Avoid using ambiguous acronyms. Instead clarify their meaning 
by qualifying them with a classifier ‘root’ to form a super root or 
a tethered root or use the complete phrase. 


b) Avoid the inclusion of redundant words in a term. 


Ordering roots in a term — classifier-classified rule: Roots (super root, 
tethered root and root) within a term are organized by a left to right, 
semantic top-down, classifier-classified hierarchy. In general, 
classifier and classified roots are expected to have one-to-many 
relationships where, in a rules-based approach, for example, the root 
alloy is a classifier for many materials. Rule 16 deals with instances 
where a relationship is not obvious or when a relationship changes 
over time due to the addition of new terms. In short, a hierarchy is not 
absolute but rather, it varies with the number of relevant use-cases. 


a) One way to identify classifier and classified roots in a term is to 
arrange the terms with an embedded hierarchical top-down, level- 
based classifier (for each ‘classifier’ term there exists several 
possibilities of ‘classified’ terms) statement with a hyphen 
between classifier and classified terms (Cs 
MODELINGSOFTWARE-VASP, MODELINGSOFTWARE-ABINIT). On 
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sorting these terms, classified roots appear as the fast varying 
strings (VASP, ABINIT) and their classifier roots appear as the 
slow varying term (MODELINGSOFTWARE). Automated methods 
may use this feature to develop hierarchical data models that can 
be presented as data-graphs or RDF or used for auto-complete to 
select terms for reliable search results. 


b) When a classifier-classified relationship does not exist among the 
roots, place them in an alphabetical order. 


Creating roots and terms with similar, multiple, or complex meanings: 
Following rule 1, use a shorter root for words with similar meaning 
whenever possible. A root embedded in a term can help automated 
methods, such as co-word analysis, natural language processing, and 
text-mining, to identify related semantic classes. To facilitate this 
process, it is recommended: a) to limit the use of synonymous roots; 
b) if necessary, clarify the semantics of a root by appending it with a 
classifier-root. 


Reusing terms to create compounded terms: Create terms by 
combining roots so that terms have clear semantics. Avoid terms that 
are broad and general in meaning. Create terms that can serve as 
‘semantic expressions’ in use-cases. A rule of thumb is to attempt to 
form terms with three roots and, if needed, combine between two and 
five terms to form suitable semantic expressions. 


Creating compounded terms that identify a group of objects in the 
terminology: Compound terms serve as ‘use-cases’ defining semantic 
expressions of terms and they are formed by concatenating two or 
more terms using a colon (:) as a special delimiting character. 
Compounded terms that are overly specific are unlikely to be reused. 
It is advised to limit the number of terms in a compounded term to 
between two and five terms. Compounded terms may point to 
persistent identifiers (PIDs), such as DOIs (Digital Object Identifier) 
for query purposes. Compounded terms may be used by database 
providers or repository administrators to cluster, identify, and display 
related items using messages like ‘related to items that you have 
viewed’. 

a) Use classifier-classified hierarchical Rule 6 to decide the order of 

terms in a compounded term. 


b) When creating compounded terms, give importance to ‘use case- 
on-demand’ hierarchies, which are case-based rather than fixed 
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10. 


ae 


IPA 


i) 


14. 


lS, 


16. 


ge 


schema-based hierarchies. Order a term so that a term to the left 
has one-to-many relationship with the term to its right. 


Providing the reference of any paper that supports the use of the new 
term(s) you are creating. The reference may serve as a ‘definition’ of 
the term as well may demonstrate use of the term within a context. 


Design for readability of compounded terms: Use uppercase for the 
first letter of a term and use lowercase for all the rest unless a root is 
a short form or a symbol. 


Provide usage statistics for terms: For each term in a database or 
repository, store its usage statistics for users to inspect, along with the 
terms. These frequencies may allow a user to avoid terms that are used 
infrequently. 


Provide semantic context of terms and compounded terms: In the 
database, also keep and display a bibliographic reference and/or DOI 
to illustrate the use and semantics of the term. This reference may also 
be used as the basis to build use-case-specific compounded terms or 
segments of data-graphs. 


Identify new terms introduced by users as well as flag terms if no 
documentation is provided. (See Rule 10) 


Allow the creation of dialects: Terms that do not follow the rules may 
also be created as local dialects when necessary. Dialects may 
facilitate a gradual evolution of rule-based terminology and the rules 
in a crowd-sourced environment. 


Curate and validate terminology and compounded terms on a regular 
basis: Dialects are important components of the proposed method for 
terminology building. Therefore, accepting or removing dialects as 
terminology must be facilitated by public resource providers who act 
as caretakers. Redefining super roots, tethered roots and classifier- 
classified relationships among roots are all important steps of the 
evolution process of the proposed term building effort. Database 
developers and repository administrators need to have an established 
mechanism for regular updates to support a smooth evolution process. 
Frequency of usage and the semantic context of terms are useful 
factors to monitor in such an evolution process. 


Apply new technologies that have been adopted widely: Explore 
whether new data technologies may require the rules to be updated. 
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Appendix B: Noun Phrase Syntax and Semantics 


One of the key components of any automated terminology 
generation system is to find and represent important concepts. Typically, 
the source of information that leads to these representations is a series of 
natural language texts, such as a corpus of scientific articles. For our root- 
and rule-based approach, representing these natural language descriptions 
is dependent on an effective model of semantics, which we discuss in this 
chapter. A complete model of natural language semantics (i.e. a perfect 
representation of meaning) is most likely beyond the scope of contemporary 
linguistics and computer science, so we will focus on a limited subset of 
semantics — namely, the compositional semantics of noun phrases. For the 
purposes of this paper, we consider a noun phrase to consist of a noun in 
addition to all of its modifiers and any determiners (such as an, the, this, 
and numbers). For example, the green vase is a noun phrase consisting of 
an article (the), a modifier (the adjective green), and a noun (vase). Some 
of the theory discussed in this section will apply to verb phrases and other 
syntactic categories, but noun phrases are ideal in that they are relatively 
concise and in that they are often clear representations of key concepts and 
commonly appear as headwords or phrases in technical glossaries. 


A very simple model of noun phrase semantics would be to simply 
represent words as themselves or their lemma-forms. For example, the 
words /eaf and leaves could both be represented as LEAF. For more complex 
noun phrases, such as green leaf, this method becomes more problematic. 
We could create a representation such as GREENLEAF, a term used 
specifically for green leaves, but this model does not unambiguously reveal 
that the meaning of GREENLEAF is related to the meanings of GREEN and 
LEAF. For this reason, we also model the syntax of noun phrases. Syntax can 
be represented in several ways, but two of the most common methods are 
phrase structure grammar and dependency grammar. These models show 
the relationships between words in a phrase, revealing for example that the 
word green in the phrase green leaf is an adjective modifying the noun /eaf. 
Syntax and compositional semantics are related concepts; the interpretation 
of a phrase is derived partially from its syntactic structure (i.e. the 
organization of the words in a phrase or sentence) (Montague 1988), and so 
using syntax as a proxy for semantics is reasonable. 


Though syntax may relate to semantics, it is not a perfect stand-in. 
There is not a one-to-one relationship between syntax and semantics — 
multiple syntactic forms can have the same semantic meaning, and the same 
syntactic form can have multiple meanings. Because of this, deriving 
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meaning purely from the way that words are put together can lead to 
ambiguities. Consider the noun phrases in examples | through 4; though 
these sentences have different syntactic structures and are composed of 
different words, they have very similar meanings. 


1) the tree’s red leaves 

2) the tree’s leaves that are red 

3) the leaves of the tree that are red 
4) the red leaves of the tree 


If we represent all of these phrases as the unordered set of 
“important” words in each sentence, all of these sentences could be 
represented as {tree, red, leaves}. However, unwanted phrases will also be 
represented this way; for example, the red tree’s leaves will also be 
represented as {tree, red, leaves}. Clearly some syntactic information is 
necessary in order to create a sufficiently discriminating system. A simple 
syntactic model is too discriminating; the words in these four sentences are 
ordered differently, are contained within different syntactic constituents, 
and are different in grammatical form (tree occurs both in its base form and 
in its possessive form tree's). Thus, we need to develop a model of syntax 
and semantics that normalizes these differences while still separating them 
from other phrases. We will begin by discussing how common models of 
syntax, such as dependency grammar, can be used as a basis for generating 
structured terms. 


B.1 Dependency Grammar 


Modern Dependency Grammar (DG) was first described in Tesniére 
(1959), and has since become one of the primary syntactic models used in 
computational linguistics. The key concept in DG is, of course, 
dependency, which is a one-to-one correspondence between morphemes in 
which every morpheme is headed by some other morpheme. For the 
purposes of this paper, we define a morpheme as the smallest unit of 
language that has meaning, including simple unbound words such as /eaf as 
well as affixes such as the -s in trees. The head of a phrase carries that 
phrase’s syntactic category (i.e. the phrase the green vase behaves as a noun 
and is considered a noun phrase because it is headed by vase, which is a 
noun on its own). In DG, the head of a sentence is the verb, and all other 
morphemes are either direct or indirect dependencies of the main verb. The 
dependencies of a morpheme are those headed by it. For example, in the 
phrase the green vase, vase is the head of green and the, so the dependencies 
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of vase are green and the. If green had its own dependencies, these would 
be indirect dependencies of vase. 


DG can be represented graphically as a tree structure with the verb 
as the parent node and dependencies represented as daughter nodes. 
However, since we are dealing exclusively with noun phrases, we will use 


a noun as the parent and adjectives, determiners, relative clauses, and other 
nouns as dependencies’. 


The four noun phrases from examples (1) through (2) are 
represented as dependency trees in Figures 10 through 13. Leaves plays the 


role of the head because it is the primary (most general) concept represented 
by the phrase 


leaves 
the of that 
a * 
tree are 
(lait ari 
the red 


Figure 10: Dependency representation of “the leaves of the tree that are 
red" 


These trees represent both syntactic dependency and linear word 
order. Each node in the tree represents a word; the dependencies of that 
word are all daughter nodes. The root, at the top of the tree, is the head and 
is not a dependency of any other words within the context of these phrasal 
trees (namely, it does not qualify any other words). The word order can be 
recovered through an in-order traversal (begin with the left-most node and 
continue to the right). 


‘i According to Hudson (2004), among others, phrases such as “the tree’s leaves” and “the green vase” are 
actually determiner phrases and not noun phrases, and are headed by determiners such as “the” or “a”. Though there 
is significant evidence for this, and most modern generative syntacticians prefer this analysis, we continue to use the 
noun phrase analysis throughout this paper due to its more intuitive simplicity and for its continued prevalence in 
computational linguistics. Furthermore, because we will end up ignoring determiners (see Appendix A.3), the 
distinction between determiner phrase and noun phrase analyses does not have an effect on the model we describe 
here. 
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leaves 
a 


the red of 
tree 


the 


Figure 11; Dependency representation of “the read leaves of the tree” 
leaf 


ery 


tree red 


Figure 12: Nomralized dependency representation of “the tree’s red leaf” 


As noted above, the syntactic and lexical information depicted by 
these dependency trees is not sufficient to show that these four noun phrases 
are similar enough to represent the same concept. Structurally speaking, 
these trees are very different; only the root node (/eaves) has the same 
position in every tree, and the remaining nodes are positioned almost 
everywhere in the tree. This suggests that we cannot use DG on its own to 
show how closely these sentences are related. However, in the following 
two sections we will show how we can adapt DG to construct a model of 
semantics that can be used to normalize noun phrases. 


leaf 


Lie ae 


tree red 


Figure 13: Normalized dependency representation of “the leaves of the tree 
that are red” 


B.2 The Semantics of Dependency 


Since we are trying to build a model of semantics, it is necessary for 
us to delve deeper into syntactic dependency in order to uncover the 
underlying semantics of noun phrases. 


Polguere and Mel’cuk (2009) describe syntactic and semantic 
dependency as separate, but related concepts. They describes syntactic 
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dependency as the bridge between the semantic form of a sentence (its 
meaning) and its morphological form (the linear string representation of 
morphemes that is spoken or written). That is, though syntactic 
dependencies have additional structure and complexity not apparent in 
semantic dependencies, the two structures are related. We may, in fact, be 
able to use the syntactic structure of a phrase to estimate its semantic 
structure, with additional understanding of the relationship between these 
two theoretical entities. 


Semantics can be described (to a certain extent) using predicate logic 
(Montague 1988). In terms of dependencies, a predicate’s dependents are 
its arguments. These arguments may themselves be predicates with 
additional dependents, allowing for recursive structures and complex 
sentences (Polguere and Mel’éuk 2009). This is an imperfect model for our 
purposes because it does not precisely correspond to syntactic dependency 
(which is, computationally speaking, easier to derive) and because it divides 
all morphemes into relationships (predicates) and entities (arguments). In 
semantics, it is unclear whether words such as a should be treated as 
predicates, arguments, or something else — in some cases, for example, they 
are treated as quantifiers (cf. (Montague 1988)). However, Polguere and 
Mel’Céuk (2009) state that all words in a particular phrase must be connected 
in a semantic dependency structure, which allows for a more consistent 
representation. 


However, despite the disparity between syntactic and semantic 
dependency described above, there are some commonalities that are 
important to the development of the model described in this paper. One of 
the functions of a predicate is to add specificity to its argument(s). In this 
sense, we can finally observe a clear similarity between syntactic 
dependencies and semantics: the dependencies of a morpheme almost 
always add specificity to the meaning of that morpheme (green /eaf is more 
specific than just /eaf). 


Consider Figure 11. The phrase represented by this structure has the 
meaning “leaves” on a very general level. On a more specific level, it is 
clear that the leaves in question are possessed objects and that they are red. 
The possession relationship is made more specific by the dependencies of 
the ’s morpheme — the tree is the possessor of the leaf, and the tree is made 
more specific through the article the, which shows that the tree in question 
is a specific tree in the common ground between the speaker and the listener 
(or between the writer and reader). We will refer to this as a semantic 
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specification relationship, because the parent node is made more specific 
through its daughters. 


Note that unlike semantic dependency as described in Polguere and 
Mel’éuk (2009), the specificity relationship always travels downward in the 
tree, and perfectly matches syntactic dependency. Semantic specificity is an 
ideal semantic model for our system, despite the fact that we still cannot 
directly explain the similarities between the four noun phrases in examples 
1 through 4, since the trees in Figures 10 and 11 are still different. In order 
to explain these similarities, we need to normalize our representations. 


B.3 Normalizing Similar Structures 


The primary differences between the four example phrases we have 
been discussing so far are found in the presence or absence of function 
morphemes — that is, grammatical units (such as the possessive morpheme 
’s) that do not correspond to any real-world meaning, but instead serve 
primarily to indicate grammatical relationships. Though these morphemes 
do affect the semantics of the overall phrase, they are primarily used to 
allow for different word-orders and slightly nuanced meanings. However, 
we can recover most of the meaning of a noun phrase without any of the 
function morphemes. 


We create the four “collapsed” trees in Figure 14 through Figure 17 
by removing all of the function morphemes, and leaving only content 
morphemes (grammatical units that do correspond to real-world meaning, 
1.e. Some particular concept, action, or trait). 


leaf 


se 


tree red 


Figure 14: Collapsed representation of the tree’s red leaves 
leaf 


2 Ye 


red tree 


Figure 15; Collapsed representation of the tree’s leaves that are red 
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leaf 


ne 


red tree 


Figure 16: Collapsed representation of the leaves of the tree that are red 
leaf 


vas 


red tree 


Figure 17: Collapsed representation of the red leaves of the tree 


At this point, the similarities between these four structures are quite 
clear: each one has /eaf (the head of the noun phrase) as the root, and two 
dependencies: the words red and tree, albeit in different positions. 
However, in our semantic model, linear order does not impact the meaning 
of the noun phrase as a whole (only vertical order does). Thus, all four of 
these structures produce the same meaning, namely that of a leaf specified 
by both tree and red. Note that there is a trade-off to removing function 
words: we do not know what type of relationship there is between /eaf and 
tree, only that a relationship exists, or that it is a generic relationship such 
as type-of (e.g. a tree-type-of-leaf). Relationships such as synonymy (words 
that are synonyms), meronymy (words that represent a part of another 
concept), and possession are not detected. However, we accept this trade- 
off for our purposes: we are looking for important concepts, not specific 
referents. 


B.4 Representing the Model 


The model of semantics described above can be represented 
graphically using tree structures such as those above, in Figure 11 through 
Figure 14. To do this in such a way that semantically similar phrases are all 
represented the same way, we need a consistent way to order daughter 
nodes. The method that we choose is largely irrelevant, so long as the order 
of daughter nodes is independent of the phrase’s original word order. A 
trivial method is to order the nodes alphabetically; this is what we will do 
for now, meaning that sentences | through 4 can all be represented using 
Figure 18. 
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leaf 


in 


red tree 


Figure 18: Universal representation for sentences I through 4) 


However, this representation is not ideal. From a computational 
standpoint, it is acceptable: it represents unambiguously all of the 
information we need to know about a noun phrase, including the semantic 
information that we want for generating terminologies. These structures can 
be generated quickly and accurately using dependency parsers (Nivre 
2003). However, they are not easily read by humans without linguistic 
training. Reading a dependency tree requires an understanding of what a 
dependency is, as well as how dependency trees are structured. For domain 
taxonomies, data structures that are not human readable are undesirable. 
Terminologies need to be read by humans as well as by machines, adding 
an additional level of challenge to the problem of generating domain 
terminologies. 


We propose to solve this problem by building structured 
compound nouns in such a way that they represent syntactic dependency 
unambiguously while remaining human-readable. Linguistically speaking, 
a compound noun is a noun composed of two or more other nouns, such as 
dog house or airplane. A structured compound noun is a compound noun 
formed through the application of regular, systematic rules. In English, the 
way that two compound nouns are formed is not completely predictable. 
Though dog house and bird house refer to houses for dogs and birds, 
respectively, a fire house is not a house for fire in the same sense. Other 
languages, however, have more productive compounding, meaning the 
same way of creating a compound will have the same meaning in all cases. 
In Sanskrit and German, for example, compound nouns often (but not 
always) have predictable meanings based off of their components and how 
they are combined. In other words, there is a set of rules that determines the 
meaning of a compound. We can adapt this idea to our semantic model in 
order to create an easy to read representation that follows from patterns in 
natural language. Because structured compound nouns are structured 
representations of phrasal semantics, they can be used to represent terms in 
a terminology. This produces a terminology of structured terms with 
predictable meanings based on roots and a set of rules used to combine 
them. 
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Our root- and rule-based approach does not use the same rules or 
patterns as Sanskrit, German, or other languages with agglutinative noun 
compounds, as the rules in these language still have many of the issues 
associated with natural language more generally, including ambiguities and 
inconsistencies. However, the processes of compounding and composition 
play a major role in root- and rule-based terminologies. 
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